Summary
Overview
Work History
Education
Skills
Personal Information
Languages
Languages
Timeline
Generic
Akshay Dubey

Akshay Dubey

Bangalore

Summary

Detail-oriented, organized and meticulous employee. Works at fast pace to meet tight deadlines. Enthusiastic team player ready to contribute to company success. Seasoned Staff Engineer experienced developing applications, databases and cloud computing solutions. Smart professional with proven troubleshooting and debugging capabilities for resolving complex technical issues. Seasoned Staff Engineer experienced developing applications, databases and cloud computing solutions. Smart professional with proven troubleshooting and debugging capabilities for resolving complex technical issues. Respectful self-motivator gifted at finding reliable solutions for software issues. Experienced in Team Building and offering skills in Python and Terraform. Fluent in English and accustomed to working with cross-cultural, global teams. Detail-oriented team player with strong organizational skills. Ability to handle multiple projects simultaneously with a high degree of accuracy. Organized and dependable candidate successful at managing multiple priorities with a positive attitude. Willingness to take on added responsibilities to meet team goals.

Overview

13
13
years of professional experience

Work History

Staff Software Engineer

Coupang
07.2024 - Current


Working with the Site Reliability Engineering (SRE) and Observability teams at Coupang, focusing on the design, deployment, and management of the Observability Ecosystem (OE) Stack, which includes Grafana Mimir (metrics), Loki (logs), and Tempo (traces).

Key responsibilities include:

  • Managing large-scale observability infrastructure deployed on Kubernetes, ensuring high availability and performance across multi-AZ environments.
  • Implementing and tuning SLIs/SLOs, alerting rules, and dashboards to provide actionable insights and proactively detect issues.
  • Building and maintaining custom exporters and integrations to bridge gaps in native observability tooling.
  • Automating operational workflows and reliability checks using Prometheus, Alloy, Alertmanager, and ArgoCD.
  • Leading SRE initiatives around incident response, capacity planning, and performance optimization.
  • Ensuring compliance and governance by maintaining audit trails, runtime overrides, and config synchronization mechanisms.

Actively collaborate with cross-functional teams to promote observability best practices, enhance developer experience, and drive a culture of reliability across the organization.


Cloud Engineer

Onehouse (Startup)
03.2024 - Current
  • Contributed to product improvement initiatives by providing valuable insights based on hands-on experience with various cloud engineering tools and frameworks.
  • Implemented security best practices across all cloud infrastructure elements, safeguarding sensitive data from unauthorized access.
  • Mentored junior engineers on industry best practices, fostering a culture of knowledge sharing within the team.
  • Established effective monitoring strategies using advanced analytics tools for real-time visibility into infrastructure performance metrics.
  • Enhanced cloud infrastructure efficiency by implementing advanced automation techniques and tools.
  • Managed disaster recovery plans and procedures, enabling rapid system restoration during unexpected outages or failures.
  • Reviewed existing systems and made recommendations for improvements.

Staff Software Engineer

Walmart
10.2022 - 03.2024
  • Led a team of SRE engineers and played a key role in implementing Observability, Telemetry, and SRE strategies across Walmart's infrastructure, ensuring real-time monitoring, debugging, and performance optimization
  • Successfully integrated open-source CNCF observability projects into Walmart's technology stack, contributing to the broader open-source community while enhancing system reliability and efficiency
  • Oversaw the development and maintenance of telemetry systems, leveraging Python, Kubernetes, Docker, Ansible, and Terraform, to collect and analyze data for identifying performance bottlenecks and optimizing resource allocation
  • Implemented best practices for incident management, root cause analysis, and post-mortem reviews, resulting in reduced downtime and improved response times
  • Collaborated with cross-functional teams to establish and meet Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring the availability, latency, and error rates of critical services
  • Introduced automation and standardization, utilizing Ansible and Terraform, to streamline SRE practices and enhance system resilience, resulting in reduced mean time to recovery (MTTR) and increased operational efficiency
  • Expertise in Azure DevOps and AKS (Azure Kubernetes Services), significantly improving deployment workflows, infrastructure management, and container orchestration within an Azure-centric environment.
  • Led the strategic implementation of a unified observability platform by leveraging OpenTelemetry (Otel) for comprehensive tracing, Prometheus for metrics collection, Grafana for data visualization, and Splunk for advanced logging and analysis. This multifaceted approach ensured end-to-end visibility across Walmart's infrastructure, enhancing real-time monitoring, debugging, and performance optimization.
  • Orchestrated the integration of OpenTelemetry across Walmart's services, enabling standardized telemetry data collection (metrics, logs, and traces) that facilitated deep insights into system behavior and performance issues. This foundational work was crucial for the seamless aggregation and correlation of data across different observability tools.
  • Championed the adoption of Prometheus for its powerful metrics collection capabilities, configuring and managing exporters to monitor a vast array of microservices and Kubernetes clusters. Developed comprehensive dashboards in Grafana that provided real-time insights into system health, application performance, and user experience, driving data-driven decision-making.
  • Advanced Walmart's logging infrastructure by integrating Splunk, enhancing log aggregation, searchability, and analysis. Implemented structured logging practices and optimized log ingestion pipelines, which significantly improved the troubleshooting speed and efficiency of identifying root causes during incidents.
  • Developed and implemented cost optimization strategies for cloud resources, particularly in Azure environments, ensuring efficient use of resources without compromising on system performance or reliability
  • Provided technical leadership and guidance to the engineering team, fostering a culture of continuous learning and innovation
  • Acted as a technical advisor to senior leadership, providing insights on industry trends, emerging technologies, and strategies to optimize system performance and user experience.

Senior Software Engineer

Walmart
12.2017 - 10.2022
  • Developed and maintained observability solutions, including real-time monitoring and log analysis, using Python, Kubernetes, Docker, and other relevant technologies
  • Contributed to open-source CNCF observability projects, actively participating in the community and collaborating with other industry experts
  • Designed and implemented telemetry systems to collect and analyze performance data, proactively identifying and resolving issues to improve system reliability and performance
  • Implemented scalable infrastructure using Kubernetes and Docker, enabling efficient deployment and management of services
  • Automated infrastructure provisioning and configuration management using Ansible, resulting in increased operational efficiency and reduced human error.

Technical Lead

Nokia
06.2016 - 12.2017
  • Kept on track with deadlines for development cycle times
  • Tested, automated and whitelisted software to be used in secure environments
  • Maintained, debugged and optimized automation programs.

Advanced Systems Engineer

Cognizant
08.2014 - 05.2016
  • Scripting with Python and Perl: I used Python and Perl programming languages extensively to develop automation scripts, tools, and utilities, contributing to the overall efficiency and reliability of database operations
  • Credit Suisse Support: I provided critical support to Credit Suisse's IT infrastructure, ensuring that database systems ran smoothly, minimized downtime, and met performance and security requirements
  • Continuous Improvement: I actively participated in identifying areas for improvement, optimizing existing processes, and staying up-to-date with the latest technologies and best practices in database management and automation.

Systems Engineer

TCS
03.2012 - 08.2014
  • Developed custom software solutions for various business needs, resulting in improved productivity.
  • Monitored system health using various performance metrics, quickly addressing any potential bottlenecks or resource limitations.
  • Created detailed project plans outlining scope, timeline, resources required, and potential risks associated with each initiative undertaken by the engineering team.
  • Contributed towards continuous improvement initiatives aimed at streamlining internal processes and enhancing overall service delivery quality.

Education

Bachelor of Engineering - Electrical Engineering

Gyan Ganga Institute of Technology And Sciences
05.2011

Skills

  • Site reliability engineering expertise
  • Observability and Telemetry
  • Python
  • Kubernetes
  • Google Cloud Platform
  • Docker
  • Ansible
  • Terraform
  • Incident Management
  • Root Cause Analysis
  • Splunk
  • Automation
  • Cross-Functional Collaboration
  • Leadership and Team Management
  • P>Open Telemetry (OTEL)

Personal Information

Languages

  • English
  • Hindi

Languages

  • English
  • Hindi

Timeline

Staff Software Engineer

Coupang
07.2024 - Current

Cloud Engineer

Onehouse (Startup)
03.2024 - Current

Staff Software Engineer

Walmart
10.2022 - 03.2024

Senior Software Engineer

Walmart
12.2017 - 10.2022

Technical Lead

Nokia
06.2016 - 12.2017

Advanced Systems Engineer

Cognizant
08.2014 - 05.2016

Systems Engineer

TCS
03.2012 - 08.2014

Bachelor of Engineering - Electrical Engineering

Gyan Ganga Institute of Technology And Sciences
Akshay Dubey