Summary
Overview
Work History
Education
Skills
Certification
Personal Information
Hobbies and Interests
Timeline
Generic
MRIGANK BHASKAR

MRIGANK BHASKAR

Delhi

Summary

Experienced Platform Site Reliability Engineer(SRE) with a strong experience in DevOps, MLOps, system design and data engineering. Skilled in platform engineering, infrastructure design, CI/CD automation, and production troubleshooting. Proven expertise in building scalable observability solutions, capacity planning, and cloud cost optimisation. Hands-on experience managing large-scale GPU-intensive workloads (>1000 GPUs), LLM-based systems, and AIOps-driven environments. Strong communication & collaboration skills, Mentoring and leading a team to deliver projects faster. Experienced with multi-cloud, AWS/GCP/Azure/OpenStack.

Overview

7
7
years of professional experience
8
8
Certification

Work History

Lead Site Reliability Engineer

Ultimate Kronos Group (UKG)
Noida, India
04.2024 - I currently work here
  • Leading a hybrid SRE for AI Services, SaaS infrastructure for a distributed system, owning end-to-end application delivery, improving system reliability, scalability, observability, incident management, and architecting the platform as a Lead SRE.
  • Designed AI/ML infrastructure platforms and implemented AI-driven anomaly detection and auto-healing workflows, reducing mean time to detect (MTTD), and mean time to recovery (MTTR).
  • Writing internal tools using Python and the Go library.
  • Using AI-assisted development tools such as Claude to accelerate coding, debugging, and design while exploring agent-based workflows to improve prototyping and delivery speed.
  • Built and optimised CI/CD pipelines for UKG Pro AI using GitHub Actions, Jenkins, services, and enabling zero/low-downtime releases through automation on GCP (Vertex AI). Achieving minimum service latency using HPA and rolling techniques.
  • Implemented Datadog Watchdog and ML-based anomaly detection, improving proactive issue detection using Golden Signals, and RED Metrics.
  • Leveraged Datadog APM to analyse traces, spans, and latency, reducing API response time and improving end-user experience.
  • Led the migration of core applications from on-prem OpenStack to Kubernetes (GKE), improving system scalability and reducing infrastructure costs.
  • Designed end-to-end infrastructure including networking, SSL/TLS, ingress, and load balancing, with auto-scaling without performance degradation.
  • Developing a strategy to deploy services such as Istio, Envoy, using Helm charts, and Harbour. Administering K8s environment to scale in and out nodes, cluster management.
  • Defined and implemented SLOs/SLIs for a SaaS application, AI services, error budgets, and alerting strategies in collaboration with product teams, improving service availability to 99.9% and reducing alert noise.
  • Drove cloud cost optimisation initiatives through right-sizing and logging optimisation, reducing overall cloud spend by $100K per month.
  • Led incident management and on-call operations, resolving P1/P2 incidents within SLA and improving system stability across load balancers (F5), HAProxy, NGINX, and API latency, thus improving user flow to our system.
  • Built and integrated OpenTelemetry (OTel) pipelines for logs, metrics, and traces into Datadog and Grafana for .NET, Python, and Go applications, standardising observability.
  • Integrated GenAI/LLM (Claude, Sonnet) into workflows for incident summarisation, log analysis, and trace correlation, reducing incident analysis time.
  • Automated infrastructure deployments using Terraform, Ansible, and Chef, orchestrating 200+ provisioning runs across hybrid environments while eliminating drift in production and non-production systems.
  • Implemented automated certificate lifecycle management (SSL, Kafka, Vault), eliminating manual renewals and reducing certificate-related incidents by 95%.
  • Managed and optimised distributed systems (RabbitMQ, Kafka, Elasticsearch, MongoDB), improving throughput and ensuring high availability uptime.
  • Led disaster recovery (DR) and business continuity planning (BCP) initiatives, reducing recovery time objectives (RTO), and ensuring compliance with uptime SLAs.
  • Implemented synthetic monitoring, browser testing, and RUM, increasing issue detection coverage by 99% and improving frontend performance visibility.
  • Developed a centralised automation platform using Python and Ansible to perform pre-checks, API validations (Swagger), and infra health checks, reducing deployment failures.
  • Managing production environments across Linux, Windows, on-premises, and AWS/GCP.

DevOps Engineer II

Teradata India Private Limited
Pune, India
03.2023 - 04.2024
  • Defined observability standards for metrics, logs, and traces across multi-cloud environments (AWS, Azure, GCP) using Datadog and Prometheus.
  • Writing internal tools using Python and Go libraries.
  • Built and optimised CI/CD pipelines using Jenkins and GitLab, implementing blue-green deployments and automated host downtime management to ensure zero or low downtime releases.
  • Deployed and maintained Dockerised microservices on AKS clusters with Helm charts, improving container orchestration reliability and supporting consistent delivery across 15+ applications.
  • Deployed containerised ML workloads on AKS clusters using Kubeflow pipelines, reducing average training-to-deployment time from 3 days to 6 hours.
  • Containerised Python and Go-based services using Docker, and orchestrated deployments on Kubernetes, improving scalability and automation.
  • Developed Terraform-based infrastructure provisioning and automated deployment workflows for cloud-native services.
  • Written SLI/SLOs, error budgets for database services, ML workloads, and AI agents using Datadog and Grafana.
  • Implemented end-to-end observability across VMs, containers, Kubernetes, and serverless environments using Datadog and OpenTelemetry.
  • Created and managed monitors and alerts (APM, infrastructure, network, RUM, synthetic, USM) to provide full-stack visibility and proactive issue detection.
  • Built a Python and Jenkins-based automation framework to deploy services across 100+ hosts in a single click, significantly improving operational efficiency.
  • Optimised observability costs by reducing log ingestion, tuning metrics collection, enabling flex logs, and defining SLIs.
  • Established monitoring from frontend to infrastructure layers, enabling comprehensive visibility into errors, latency, and system health.
  • Handling ML-heavy workloads for the Teradata recommendation system (1000 GPUs), automated end-to-end ML model lifecycle with MLflow and Azure ML, handling training, packaging, and deployment of 120+ models across NLP, and recommendation systems.
  • Integrated AI/ML-based anomaly detection and automated alerting workflows in Datadog, with integrations to PagerDuty, Slack, and MS Teams.
  • Managed production incidents across monolithic and microservices architectures, performed RCA, and implemented preventive measures.
  • Participated in on-call rotations, resolving high-severity incidents within SLA, and improving system reliability.
  • Leading Platform System design for newly onboarded services from platform building, observability, and application release pipelines.

Sr. Analyst/Software Engineer

Capgemini Technology
Bengaluru, India
01.2021 - 03.2023
  • Worked in a DevOps team managing Linux-based infrastructure (RHEL, Unix, HP-UX, IBM AIX) and Azure cloud services, including AKS, Azure DevOps pipelines, and PaaS offerings.
  • Implemented end-to-end CI/CD pipelines for complex microservices applications using Jenkins, enabling automated and efficient deployments.
  • Developed Jenkins pipelines to deploy applications on Kubernetes clusters, ensuring smooth and consistent releases.
  • Ensured code quality through static analysis using SonarQube and managed artifacts via JFrog repository.
  • Monitored infrastructure and application performance using Datadog, Splunk, and Azure monitoring tools.
  • Built Docker images for multiple application modules (Java and Python-based services).
  • Performed daily Linux/Unix administration and managed large-scale infrastructure across on-premises and Azure environments.
  • Automated operational tasks using Ansible (playbooks and ad-hoc commands), managing 1000+ production servers.
  • Led security vulnerability patching across Dev, QA, and Production environments, ensuring systems remained secure and up to date.
  • Conducted RCA for major incidents and implemented mitigation strategies to prevent recurrence.
  • Managed Kubernetes clusters and handled application deployments, troubleshooting, and performance issues.
  • Contributed to the setup of Kubernetes clusters on Azure (AKS) for application migration and testing.
  • Owned the end-to-end DevOps lifecycle, including monitoring setup and alerting using Splunk, Datadog, and Azure Application Monitoring.
  • Developed Python scripts to generate large-scale user data reports from production systems.
  • Managed Docker and Kubernetes platforms, handling deployments and resolving production issues.
  • Wrote Bash scripts for automation tasks such as pre-checks, health checks, and sanity validations.
  • Used ServiceNow for incident, change, and problem management as part of ITSM processes.

Executive

Mindlance
Noida, India
06.2019 - 01.2021
  • Engaging with various clients, stakeholders, project managers to understand requirements and worked accordingly.
  • Participating in various technical activities including sourcing in Software engineering.

Education

Bachelor of Technology - Electronics and Communication Engineering

Delhi Institute of Technology & Management
Delhi
05-2019

Skills

  • Linux
  • Jenkins, GitHub Actions, and TeamCity
  • Docker
  • Kubernetes
  • Datadog
  • Splunk
  • Terraform
  • Ansible
  • GIT/GitHub
  • Azure/AWS/GCP
  • Bash/Python
  • SNOW/JIRA

Certification

  • Applied Generative AI Specialisation - Purdue University (In Progress)
  • GCP Professional Cloud DevOps Engineer Certification - Jan 2028, 17594b32-2785-4ce7-ad1d-7f65e3161180
  • Master Program – DevOps Engineer, 07/17/23, 77828740
  • Docker Certified Associate, 01/06/22
  • Microsoft Azure DevOps Expert, 07/01/22, I336-9914
  • Microsoft Azure Administrator, 06/31/22, I327-7892
  • DevOps Training Certification, 01/01/22, 3175121
  • Oracle certified foundation Associate, 53571550CIBF2021

Personal Information

Title: Lead SRE/

Hobbies and Interests

  • Swimming
  • Sports
  • Travelling

Timeline

Lead Site Reliability Engineer

Ultimate Kronos Group (UKG)
04.2024 - I currently work here

DevOps Engineer II

Teradata India Private Limited
03.2023 - 04.2024

Sr. Analyst/Software Engineer

Capgemini Technology
01.2021 - 03.2023

Executive

Mindlance
06.2019 - 01.2021

Bachelor of Technology - Electronics and Communication Engineering

Delhi Institute of Technology & Management
MRIGANK BHASKAR