Experienced Platform Site Reliability Engineer(SRE) with a strong experience in DevOps, MLOps, system design and data engineering. Skilled in platform engineering, infrastructure design, CI/CD automation, and production troubleshooting. Proven expertise in building scalable observability solutions, capacity planning, and cloud cost optimisation. Hands-on experience managing large-scale GPU-intensive workloads (>1000 GPUs), LLM-based systems, and AIOps-driven environments. Strong communication & collaboration skills, Mentoring and leading a team to deliver projects faster. Experienced with multi-cloud, AWS/GCP/Azure/OpenStack.
Overview
7
7
years of professional experience
8
8
Certification
Work History
Lead Site Reliability Engineer
Ultimate Kronos Group (UKG)
Noida, India
04.2024 - I currently work here
Leading a hybrid SRE for AI Services, SaaS infrastructure for a distributed system, owning end-to-end application delivery, improving system reliability, scalability, observability, incident management, and architecting the platform as a Lead SRE.
Designed AI/ML infrastructure platforms and implemented AI-driven anomaly detection and auto-healing workflows, reducing mean time to detect (MTTD), and mean time to recovery (MTTR).
Writing internal tools using Python and the Go library.
Using AI-assisted development tools such as Claude to accelerate coding, debugging, and design while exploring agent-based workflows to improve prototyping and delivery speed.
Built and optimised CI/CD pipelines for UKG Pro AI using GitHub Actions, Jenkins, services, and enabling zero/low-downtime releases through automation on GCP (Vertex AI). Achieving minimum service latency using HPA and rolling techniques.
Implemented Datadog Watchdog and ML-based anomaly detection, improving proactive issue detection using Golden Signals, and RED Metrics.
Leveraged Datadog APM to analyse traces, spans, and latency, reducing API response time and improving end-user experience.
Led the migration of core applications from on-prem OpenStack to Kubernetes (GKE), improving system scalability and reducing infrastructure costs.
Designed end-to-end infrastructure including networking, SSL/TLS, ingress, and load balancing, with auto-scaling without performance degradation.
Developing a strategy to deploy services such as Istio, Envoy, using Helm charts, and Harbour. Administering K8s environment to scale in and out nodes, cluster management.
Defined and implemented SLOs/SLIs for a SaaS application, AI services, error budgets, and alerting strategies in collaboration with product teams, improving service availability to 99.9% and reducing alert noise.
Drove cloud cost optimisation initiatives through right-sizing and logging optimisation, reducing overall cloud spend by $100K per month.
Led incident management and on-call operations, resolving P1/P2 incidents within SLA and improving system stability across load balancers (F5), HAProxy, NGINX, and API latency, thus improving user flow to our system.
Built and integrated OpenTelemetry (OTel) pipelines for logs, metrics, and traces into Datadog and Grafana for .NET, Python, and Go applications, standardising observability.
Integrated GenAI/LLM (Claude, Sonnet) into workflows for incident summarisation, log analysis, and trace correlation, reducing incident analysis time.
Automated infrastructure deployments using Terraform, Ansible, and Chef, orchestrating 200+ provisioning runs across hybrid environments while eliminating drift in production and non-production systems.
Implemented automated certificate lifecycle management (SSL, Kafka, Vault), eliminating manual renewals and reducing certificate-related incidents by 95%.
Managed and optimised distributed systems (RabbitMQ, Kafka, Elasticsearch, MongoDB), improving throughput and ensuring high availability uptime.
Led disaster recovery (DR) and business continuity planning (BCP) initiatives, reducing recovery time objectives (RTO), and ensuring compliance with uptime SLAs.
Implemented synthetic monitoring, browser testing, and RUM, increasing issue detection coverage by 99% and improving frontend performance visibility.
Developed a centralised automation platform using Python and Ansible to perform pre-checks, API validations (Swagger), and infra health checks, reducing deployment failures.
Managing production environments across Linux, Windows, on-premises, and AWS/GCP.
DevOps Engineer II
Teradata India Private Limited
Pune, India
03.2023 - 04.2024
Defined observability standards for metrics, logs, and traces across multi-cloud environments (AWS, Azure, GCP) using Datadog and Prometheus.
Writing internal tools using Python and Go libraries.
Built and optimised CI/CD pipelines using Jenkins and GitLab, implementing blue-green deployments and automated host downtime management to ensure zero or low downtime releases.
Deployed and maintained Dockerised microservices on AKS clusters with Helm charts, improving container orchestration reliability and supporting consistent delivery across 15+ applications.
Deployed containerised ML workloads on AKS clusters using Kubeflow pipelines, reducing average training-to-deployment time from 3 days to 6 hours.
Containerised Python and Go-based services using Docker, and orchestrated deployments on Kubernetes, improving scalability and automation.
Developed Terraform-based infrastructure provisioning and automated deployment workflows for cloud-native services.
Written SLI/SLOs, error budgets for database services, ML workloads, and AI agents using Datadog and Grafana.
Implemented end-to-end observability across VMs, containers, Kubernetes, and serverless environments using Datadog and OpenTelemetry.
Created and managed monitors and alerts (APM, infrastructure, network, RUM, synthetic, USM) to provide full-stack visibility and proactive issue detection.
Built a Python and Jenkins-based automation framework to deploy services across 100+ hosts in a single click, significantly improving operational efficiency.
Optimised observability costs by reducing log ingestion, tuning metrics collection, enabling flex logs, and defining SLIs.
Established monitoring from frontend to infrastructure layers, enabling comprehensive visibility into errors, latency, and system health.
Handling ML-heavy workloads for the Teradata recommendation system (1000 GPUs), automated end-to-end ML model lifecycle with MLflow and Azure ML, handling training, packaging, and deployment of 120+ models across NLP, and recommendation systems.
Integrated AI/ML-based anomaly detection and automated alerting workflows in Datadog, with integrations to PagerDuty, Slack, and MS Teams.
Managed production incidents across monolithic and microservices architectures, performed RCA, and implemented preventive measures.
Participated in on-call rotations, resolving high-severity incidents within SLA, and improving system reliability.
Leading Platform System design for newly onboarded services from platform building, observability, and application release pipelines.
Sr. Analyst/Software Engineer
Capgemini Technology
Bengaluru, India
01.2021 - 03.2023
Worked in a DevOps team managing Linux-based infrastructure (RHEL, Unix, HP-UX, IBM AIX) and Azure cloud services, including AKS, Azure DevOps pipelines, and PaaS offerings.
Implemented end-to-end CI/CD pipelines for complex microservices applications using Jenkins, enabling automated and efficient deployments.
Developed Jenkins pipelines to deploy applications on Kubernetes clusters, ensuring smooth and consistent releases.
Ensured code quality through static analysis using SonarQube and managed artifacts via JFrog repository.
Monitored infrastructure and application performance using Datadog, Splunk, and Azure monitoring tools.
Built Docker images for multiple application modules (Java and Python-based services).
Performed daily Linux/Unix administration and managed large-scale infrastructure across on-premises and Azure environments.
Automated operational tasks using Ansible (playbooks and ad-hoc commands), managing 1000+ production servers.
Led security vulnerability patching across Dev, QA, and Production environments, ensuring systems remained secure and up to date.
Conducted RCA for major incidents and implemented mitigation strategies to prevent recurrence.
Managed Kubernetes clusters and handled application deployments, troubleshooting, and performance issues.
Contributed to the setup of Kubernetes clusters on Azure (AKS) for application migration and testing.
Owned the end-to-end DevOps lifecycle, including monitoring setup and alerting using Splunk, Datadog, and Azure Application Monitoring.
Developed Python scripts to generate large-scale user data reports from production systems.
Managed Docker and Kubernetes platforms, handling deployments and resolving production issues.
Wrote Bash scripts for automation tasks such as pre-checks, health checks, and sanity validations.
Used ServiceNow for incident, change, and problem management as part of ITSM processes.
Executive
Mindlance
Noida, India
06.2019 - 01.2021
Engaging with various clients, stakeholders, project managers to understand requirements and worked accordingly.
Participating in various technical activities including sourcing in Software engineering.
Education
Bachelor of Technology - Electronics and Communication Engineering
Delhi Institute of Technology & Management
Delhi
05-2019
Skills
Linux
Jenkins, GitHub Actions, and TeamCity
Docker
Kubernetes
Datadog
Splunk
Terraform
Ansible
GIT/GitHub
Azure/AWS/GCP
Bash/Python
SNOW/JIRA
Certification
Applied Generative AI Specialisation - Purdue University (In Progress)
GCP Professional Cloud DevOps Engineer Certification - Jan 2028, 17594b32-2785-4ce7-ad1d-7f65e3161180
Master Program – DevOps Engineer, 07/17/23, 77828740
Docker Certified Associate, 01/06/22
Microsoft Azure DevOps Expert, 07/01/22, I336-9914
Microsoft Azure Administrator, 06/31/22, I327-7892
DevOps Training Certification, 01/01/22, 3175121
Oracle certified foundation Associate, 53571550CIBF2021
Personal Information
Title: Lead SRE/
Hobbies and Interests
Swimming
Sports
Travelling
Timeline
Lead Site Reliability Engineer
Ultimate Kronos Group (UKG)
04.2024 - I currently work here
DevOps Engineer II
Teradata India Private Limited
03.2023 - 04.2024
Sr. Analyst/Software Engineer
Capgemini Technology
01.2021 - 03.2023
Executive
Mindlance
06.2019 - 01.2021
Bachelor of Technology - Electronics and Communication Engineering