Summary

Overview

Work History

Education

Skills

Certification

Personal Information

Hobbies and Interests

Timeline

MRIGANK BHASKAR

Delhi

Summary

Experienced Platform Site Reliability Engineer(SRE) with a strong experience in DevOps, MLOps, system design and data engineering. Skilled in platform engineering, infrastructure design, CI/CD automation, and production troubleshooting. Proven expertise in building scalable observability solutions, capacity planning, and cloud cost optimisation. Hands-on experience managing large-scale GPU-intensive workloads (>1000 GPUs), LLM-based systems, and AIOps-driven environments. Strong communication & collaboration skills, Mentoring and leading a team to deliver projects faster. Experienced with multi-cloud, AWS/GCP/Azure/OpenStack.

Overview

years of professional experience

Certification

Work History

Lead Site Reliability Engineer

Ultimate Kronos Group (UKG)

Noida, India

04.2024 - I currently work here

Leading a hybrid SRE for AI Services, SaaS infrastructure for a distributed system, owning end-to-end application delivery, improving system reliability, scalability, observability, incident management, and architecting the platform as a Lead SRE.
Designed AI/ML infrastructure platforms and implemented AI-driven anomaly detection and auto-healing workflows, reducing mean time to detect (MTTD), and mean time to recovery (MTTR).
Writing internal tools using Python and the Go library.
Using AI-assisted development tools such as Claude to accelerate coding, debugging, and design while exploring agent-based workflows to improve prototyping and delivery speed.
Built and optimised CI/CD pipelines for UKG Pro AI using GitHub Actions, Jenkins, services, and enabling zero/low-downtime releases through automation on GCP (Vertex AI). Achieving minimum service latency using HPA and rolling techniques.
Implemented Datadog Watchdog and ML-based anomaly detection, improving proactive issue detection using Golden Signals, and RED Metrics.
Leveraged Datadog APM to analyse traces, spans, and latency, reducing API response time and improving end-user experience.
Led the migration of core applications from on-prem OpenStack to Kubernetes (GKE), improving system scalability and reducing infrastructure costs.
Designed end-to-end infrastructure including networking, SSL/TLS, ingress, and load balancing, with auto-scaling without performance degradation.
Developing a strategy to deploy services such as Istio, Envoy, using Helm charts, and Harbour. Administering K8s environment to scale in and out nodes, cluster management.
Defined and implemented SLOs/SLIs for a SaaS application, AI services, error budgets, and alerting strategies in collaboration with product teams, improving service availability to 99.9% and reducing alert noise.
Drove cloud cost optimisation initiatives through right-sizing and logging optimisation, reducing overall cloud spend by $100K per month.
Led incident management and on-call operations, resolving P1/P2 incidents within SLA and improving system stability across load balancers (F5), HAProxy, NGINX, and API latency, thus improving user flow to our system.
Built and integrated OpenTelemetry (OTel) pipelines for logs, metrics, and traces into Datadog and Grafana for .NET, Python, and Go applications, standardising observability.
Integrated GenAI/LLM (Claude, Sonnet) into workflows for incident summarisation, log analysis, and trace correlation, reducing incident analysis time.
Automated infrastructure deployments using Terraform, Ansible, and Chef, orchestrating 200+ provisioning runs across hybrid environments while eliminating drift in production and non-production systems.
Implemented automated certificate lifecycle management (SSL, Kafka, Vault), eliminating manual renewals and reducing certificate-related incidents by 95%.
Managed and optimised distributed systems (RabbitMQ, Kafka, Elasticsearch, MongoDB), improving throughput and ensuring high availability uptime.
Led disaster recovery (DR) and business continuity planning (BCP) initiatives, reducing recovery time objectives (RTO), and ensuring compliance with uptime SLAs.
Implemented synthetic monitoring, browser testing, and RUM, increasing issue detection coverage by 99% and improving frontend performance visibility.
Developed a centralised automation platform using Python and Ansible to perform pre-checks, API validations (Swagger), and infra health checks, reducing deployment failures.
Managing production environments across Linux, Windows, on-premises, and AWS/GCP.

DevOps Engineer II

Teradata India Private Limited

Pune, India

03.2023 - 04.2024

Defined observability standards for metrics, logs, and traces across multi-cloud environments (AWS, Azure, GCP) using Datadog and Prometheus.
Writing internal tools using Python and Go libraries.
Built and optimised CI/CD pipelines using Jenkins and GitLab, implementing blue-green deployments and automated host downtime management to ensure zero or low downtime releases.
Deployed and maintained Dockerised microservices on AKS clusters with Helm charts, improving container orchestration reliability and supporting consistent delivery across 15+ applications.
Deployed containerised ML workloads on AKS clusters using Kubeflow pipelines, reducing average training-to-deployment time from 3 days to 6 hours.
Containerised Python and Go-based services using Docker, and orchestrated deployments on Kubernetes, improving scalability and automation.
Developed Terraform-based infrastructure provisioning and automated deployment workflows for cloud-native services.
Written SLI/SLOs, error budgets for database services, ML workloads, and AI agents using Datadog and Grafana.
Implemented end-to-end observability across VMs, containers, Kubernetes, and serverless environments using Datadog and OpenTelemetry.
Created and managed monitors and alerts (APM, infrastructure, network, RUM, synthetic, USM) to provide full-stack visibility and proactive issue detection.
Built a Python and Jenkins-based automation framework to deploy services across 100+ hosts in a single click, significantly improving operational efficiency.
Optimised observability costs by reducing log ingestion, tuning metrics collection, enabling flex logs, and defining SLIs.
Established monitoring from frontend to infrastructure layers, enabling comprehensive visibility into errors, latency, and system health.
Handling ML-heavy workloads for the Teradata recommendation system (1000 GPUs), automated end-to-end ML model lifecycle with MLflow and Azure ML, handling training, packaging, and deployment of 120+ models across NLP, and recommendation systems.
Integrated AI/ML-based anomaly detection and automated alerting workflows in Datadog, with integrations to PagerDuty, Slack, and MS Teams.
Managed production incidents across monolithic and microservices architectures, performed RCA, and implemented preventive measures.
Participated in on-call rotations, resolving high-severity incidents within SLA, and improving system reliability.
Leading Platform System design for newly onboarded services from platform building, observability, and application release pipelines.

Sr. Analyst/Software Engineer

Capgemini Technology

Bengaluru, India

01.2021 - 03.2023

Worked in a DevOps team managing Linux-based infrastructure (RHEL, Unix, HP-UX, IBM AIX) and Azure cloud services, including AKS, Azure DevOps pipelines, and PaaS offerings.
Implemented end-to-end CI/CD pipelines for complex microservices applications using Jenkins, enabling automated and efficient deployments.
Developed Jenkins pipelines to deploy applications on Kubernetes clusters, ensuring smooth and consistent releases.
Ensured code quality through static analysis using SonarQube and managed artifacts via JFrog repository.
Monitored infrastructure and application performance using Datadog, Splunk, and Azure monitoring tools.
Built Docker images for multiple application modules (Java and Python-based services).
Performed daily Linux/Unix administration and managed large-scale infrastructure across on-premises and Azure environments.
Automated operational tasks using Ansible (playbooks and ad-hoc commands), managing 1000+ production servers.
Led security vulnerability patching across Dev, QA, and Production environments, ensuring systems remained secure and up to date.
Conducted RCA for major incidents and implemented mitigation strategies to prevent recurrence.
Managed Kubernetes clusters and handled application deployments, troubleshooting, and performance issues.
Contributed to the setup of Kubernetes clusters on Azure (AKS) for application migration and testing.
Owned the end-to-end DevOps lifecycle, including monitoring setup and alerting using Splunk, Datadog, and Azure Application Monitoring.
Developed Python scripts to generate large-scale user data reports from production systems.
Managed Docker and Kubernetes platforms, handling deployments and resolving production issues.
Wrote Bash scripts for automation tasks such as pre-checks, health checks, and sanity validations.
Used ServiceNow for incident, change, and problem management as part of ITSM processes.

Executive

Mindlance

Noida, India

06.2019 - 01.2021

Engaging with various clients, stakeholders, project managers to understand requirements and worked accordingly.
Participating in various technical activities including sourcing in Software engineering.

Education

Bachelor of Technology - Electronics and Communication Engineering

Delhi Institute of Technology & Management

Delhi

05-2019

Skills

Linux
Jenkins, GitHub Actions, and TeamCity
Docker
Kubernetes
Datadog
Splunk

Terraform
Ansible
GIT/GitHub
Azure/AWS/GCP
Bash/Python
SNOW/JIRA

Certification

Applied Generative AI Specialisation - Purdue University (In Progress)
GCP Professional Cloud DevOps Engineer Certification - Jan 2028, 17594b32-2785-4ce7-ad1d-7f65e3161180
Master Program – DevOps Engineer, 07/17/23, 77828740
Docker Certified Associate, 01/06/22
Microsoft Azure DevOps Expert, 07/01/22, I336-9914
Microsoft Azure Administrator, 06/31/22, I327-7892
DevOps Training Certification, 01/01/22, 3175121
Oracle certified foundation Associate, 53571550CIBF2021

Personal Information

Title: Lead SRE/

Hobbies and Interests

Swimming
Sports
Travelling

Timeline

Lead Site Reliability Engineer

Ultimate Kronos Group (UKG)

04.2024 - I currently work here

DevOps Engineer II

Teradata India Private Limited

03.2023 - 04.2024

Sr. Analyst/Software Engineer

Capgemini Technology

01.2021 - 03.2023

Executive

Mindlance

06.2019 - 01.2021

Bachelor of Technology - Electronics and Communication Engineering

Delhi Institute of Technology & Management

MRIGANK BHASKAR

Summary

Overview

Work History

Lead Site Reliability Engineer

DevOps Engineer II

Sr. Analyst/Software Engineer

Executive

Education

Bachelor of Technology - Electronics and Communication Engineering

Skills

Certification

Personal Information

Hobbies and Interests

Timeline

Lead Site Reliability Engineer

DevOps Engineer II

Sr. Analyst/Software Engineer

Executive

Bachelor of Technology - Electronics and Communication Engineering

Similar Profiles

Arati KulkarniArati Kulkarni

Pratik YadavPratik Yadav

Anil GautamAnil Gautam

Emrah BASDEMIREmrah BASDEMIR