Akash Bhatia

New Delhi

Summary

Site Reliability Engineer / DevOps Engineer with 6+ years of experience building, operating, and scaling reliable systems on AWS and Kubernetes. Expert in observability (Datadog, Prometheus, Grafana), incident response, performance tuning, and Infrastructure-as-Code with Terraform. Delivered multiple zero-downtime migrations (Redis.io, CloudAMQP) and engineered DR for EKS to uphold strict SLAs and improve MTTA/MTTR.

Overview

years of professional experience

Certification

Work History

Site Reliability Engineer / DevOps Engineer

Monotype Imaging

10.2021 - Current

Owned reliability for multiple high-traffic websites; defined SLAs/SLOs and implemented proactive, signal-rich alerting for 24×7 availability.
Designed DR architecture for EKS and surrounding services; authored runbooks and led periodic recovery drills.
Built and maintained Terraform modules to provision and scale AWS resources (EC2, EBS, EFS, ALB/ELB, API Gateway, CloudWatch, VPC).
Established end-to-end observability with Datadog, Site24x7, Prometheus, Grafana, and VictorOps; reduced noise and improved time-to-detect.
Led and mentored an 8-member L1 team; guided RCA creation and drove actions that improved MTTA and MTTR.
Investigated high-latency using APM; collaborated with developers on code-level fixes across microservices to improve throughput and p95/p99 latency.
Worked on creating an Internal Developer Portal to improve operational excellence, streamline workflows, and enhance developer productivity.
Developed centralized Grafana & Prometheus for pre-prod monitoring, reducing SaaS costs by replacing Datadog for non-prod environments.
Regularly contributed in architectural reviews from project inception, ensuring scalability, cost-efficiency, reliability, and high performance.
Leveraged AI-powered anomaly detection in observability platforms (Datadog, Prometheus) to proactively identify performance degradation before it impacted users.
Implemented AI-assisted log analysis for OpenSearch to accelerate root cause identification and reduce MTTR.
Partnered with QA, leveraging RUM insights to reproduce and resolve user-impacting issues; improved customer experience and stability.

Infrastructure Engineer

Publicis Sapient (Sapient Consulting Pvt. Ltd.)

05.2019 - 10.2021

Managed infrastructure on AWS and VMware; automated provisioning and patching with Ansible.
Hands-on experience with Cloud DevOps tools: EC2, EBS, ALB/ELB, CloudWatch, GitHub, and Jenkins for CI/CD.
Set up monitoring/alerting for infra and apps; ensured accurate signals and timely escalations.
Worked with Apache, Nginx, and Tomcat; supported application teams and reduced manual toil via scripting.
Performed DR testing with Veeam; executed repository and server migrations (Bitbucket, vCenter).

Education

B.Tech - Electronics & Communication

MDU

CBSE - undefined

VV DAV Public School

Skills

Cloud: AWS (EC2, EBS, EFS, ALB/ELB, S3, Route 53, VPC, CloudWatch, API Gateway, ElastiCache), CloudAMQP

IaC & CI/CD: Terraform, Ansible, Jenkins, Git, GitHub, Bitbucket

Containers & Orchestration: Docker, Kubernetes (EKS)

Observability: Datadog, Prometheus, Grafana, Site24x7, New Relic, Kibana, VictorOps (Splunk On-Call)