Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

ANIERUTAN BALAJI

Chennai

Summary

Product support & Site Reliability Engineering, Manager with 13+ years of experience leading 24/7 production operations, incident management, and reliability engineering for large-scale cloud platforms. Reduced MTTR, led Sev-1/Sev-2 incident response, and built globally distributed follow-the-sun support teams for distributed systems in cloud. Strong experience across AWS infrastructure, observability, incident tooling, and cross-functional execution to deliver highly available and resilient digital platforms. Managed production incident operations for a fleet of 95,000 servers across hybrid cloud infrastructure.

Overview

11
11
years of professional experience
1
1
Certification

Work History

Senior Engineering manager– Site Reliability & CX

Celigo
Hyderabad, India
02.2024 - Current

Lead 24/7 production operations and incident management for a global SaaS integration platform.

  • Reduced MTTR from 48 days to under 7 days (Feb 2024–Oct 2025) through cross-training, refined communication protocols, and diagnostic automation across all issue types.
  • Spearheaded as Incident Commander for Sev-1/Sev-2 incidents, improving Sev-1 resolution from 36 hours → 5 hours and Sev-2 from 36 hours → 8 hours, significantly improving customer experience and SLA compliance.
  • Designed and implemented an org-wide on-call model (July 2025) integrating Splunk P0 alerts with PagerDuty across 35 alerts and 9 physical services — cutting Sev-1 response time from 23 hours to under 3 minutes and ack time from 11 hours to under 3 minutes.
  • Built observability coverage across 68 logical services with 66 Splunk P0 alerts and 226 New Relic P0 alerts; developing customer-level observability dashboards using Splunk, Snowflake, and anomaly detection.
  • Developed RCA/COE automation framework (March 2024), reducing incident turnaround by 7 days and saving 100+ developer hours per cycle; established monthly COE governance cadence (June 2025) to track heat maps, root causes, and closure accountability.
  • Contributed to AI/ML-based anomaly detection (July 2024) that proactively identified critical anomalies pre-deployment, preventing outages during microservices migration.
  • Delivered AI-powered Slack and Zendesk operational bots (Oct 2024) automating runbook guidance and ticket summaries, saving 102 developer hours and 252 CSM hours monthly.
  • Integrated Cursor AI (Aug 2025) and pioneered Jira MCP integration, reducing debugging and root-cause analysis time from 1 day to ~2 hours — saving ~126 hours per quarter.
  • Built executive ThoughtSpot dashboards (Sep 2025) providing real-time visibility into SLA adherence, MTTR, backlog aging, and throughput, now used by leadership for performance tracking.
  • Improved independent ticket resolution rate from 75% (Q1 2024) to 87%, reducing engineering dependency; served as primary escalation owner for Tier-1 clients during critical incidents.
  • Architected DiagnostIQ, an AI agent using LangChain and Knowledge Base grounding to provide L1/L2 teams with automated diagnostic access to Splunk — reducing investigation time from 60 minutes to under 5 minutes with zero-hallucination accuracy.
  • Mentored team members into product advocates, driving 15–20 product improvements and 175+ documentation enhancements; authored 90+ runbooks to accelerate L2 technical ramp-up and self-service adoption.

Support Engineering Manager

Amazon
Chennai
10.2021 - 10.2023
  • Mentored 2 managers and 12 engineers, interviewing 130+ candidates and building high-performing Kindle support teams.
  • Achieved 85% self-service resolution across 8 applications, reducing engineering dependency.
  • Delivered 87% SLA compliance through improved incident handling and escalation management.
  • Led migration of 38 data pipelines during FDP to Cradle streaming transformation.
  • Managed migration of a legacy security system to CloudAuth + AAA architecture.
  • Co-authored Correction of Error (COE) frameworks to improve operational standards.
  • Led development of an ML-based auditing tool to categorize and prioritize support tickets, reducing 10k+ tickets annually.
  • Established standardized ticket management practices across Kindle support.
  • Led learning and capability programs improving communication, problem-solving, and technical leadership skills.

Senior Software Support Engineer

Amazon
01.2020 - 10.2021
  • Reduced KDP Authors Reporting Service 'KARS' initial ticket queue from 150 to less than 80 within a month and Brought queue to steady state '20' in a year by providing technical expertise
  • Developed a Python tool to migrate data between production AWS account to a beta AWS account, containing approximately 8,000 lines of code, in under a month by Strategic planning this technically challenging project
  • Played Project management role in feature release for GDPR compliance code delivery using agile methods and coded in Python to obfuscate data in 38 tables with nested primary key values in DynamoDB
  • Facilitated process improvement & standardization teams to reduce 50% incoming tickets to the KARS from the Customer Service & Support 'CSS' team
  • Impact: Developed Python tools to automate the re-driving of messages from Dead Letter Queues 'DLQ' to Simple Queue Service 'SQS', saving 720 tickets/ year
  • Analyzed fleet utilization, improving CPU utilization to 60% from 40%, resulting in annual savings of $7,568
  • Obtained security certifications for KARS applications in collaboration with legal, privacy, and security teams
  • Architected & prototyped a POC of upstream teams' messages to test the backend input & aggregation process for the new reporting architecture.

Service Engineer L2/L3

Microsoft
Redmond
08.2015 - 06.2018
  • Managed incident operations across 95,000 servers in hybrid cloud environments.
  • Administered Windows Server environments (2008R2, 2012, 2016) using SCOM clusters.
  • Supported infrastructure components including Active Directory, DNS, DHCP, and Hyper-V.
  • Managed patching, connectivity, VLAN configuration, and cluster operations.
  • Ensured SOX and CISM compliance across infrastructure services.

Award:
Received Star Certificate of Excellence for automating 700 monthly tickets using PowerShell, saving 20,000 man-hours annually.

Technical Business Analyst

CHI
Englewood
07.2019
  • Configured, provisioned compute & storage for 120 Windows and Linux servers & Managed project to virtualize leveraging cloud-based infrastructure for medical apps using Git, Jenkins, Docker, Chef, and Nagios using ESXi and Azure for effective business integration & sales.

Associate RF QA Engineer

Alcatel-Lucent Enterprise
Irvine
12.2014
  • Coordinated teams for LTE and UMTS drive testing, optimizing network KPIs and delivered reports.

Escalation support / application layer QA

Cisco
Chennai
11.2012
  • Ensured strong Communication & QA for Cisco's 'Service Control Engine' (a deep packet inspection module).

Education

Master of Science (MS), Electrical Engineering - Electrical engg

University of South Florida
Tampa
01.2015

Skills

  • Scripting & Coding
  • Python
  • PowerShell
  • Shell scripting
  • Cloud exposure
  • AWS
  • Azure
  • Hyper-V
  • V-sphere
  • Practices
  • Servant Leadership
  • IT Infrastructure Management
  • CI/CD Dev-ops
  • Splunk
  • PagerDuty
  • Serivcenow/Jira
  • KPI reporting (Thoughspot, snowflake, Sigma)
  • incident tooling

Certification

  • PMI-ACP®
  • CSPO

Timeline

Senior Engineering manager– Site Reliability & CX

Celigo
02.2024 - Current

Support Engineering Manager

Amazon
10.2021 - 10.2023

Senior Software Support Engineer

Amazon
01.2020 - 10.2021

Technical Business Analyst

CHI
07.2019

Service Engineer L2/L3

Microsoft
08.2015 - 06.2018

Associate RF QA Engineer

Alcatel-Lucent Enterprise
12.2014

Escalation support / application layer QA

Cisco
11.2012

Master of Science (MS), Electrical Engineering - Electrical engg

University of South Florida
ANIERUTAN BALAJI