Summary

Overview

Work History

Education

Skills

Certification

Timeline

ANIERUTAN BALAJI

Chennai

Summary

Product support & Site Reliability Engineering, Manager with 13+ years of experience leading 24/7 production operations, incident management, and reliability engineering for large-scale cloud platforms. Reduced MTTR, led Sev-1/Sev-2 incident response, and built globally distributed follow-the-sun support teams for distributed systems in cloud. Strong experience across AWS infrastructure, observability, incident tooling, and cross-functional execution to deliver highly available and resilient digital platforms. Managed production incident operations for a fleet of 95,000 servers across hybrid cloud infrastructure.

Overview

years of professional experience

Certification

Work History

Senior Engineering manager– Site Reliability & CX

Celigo

Hyderabad, India

02.2024 - Current

Lead 24/7 production operations and incident management for a global SaaS integration platform.

Reduced MTTR from 48 days to under 7 days (Feb 2024–Oct 2025) through cross-training, refined communication protocols, and diagnostic automation across all issue types.
Spearheaded as Incident Commander for Sev-1/Sev-2 incidents, improving Sev-1 resolution from 36 hours → 5 hours and Sev-2 from 36 hours → 8 hours, significantly improving customer experience and SLA compliance.
Designed and implemented an org-wide on-call model (July 2025) integrating Splunk P0 alerts with PagerDuty across 35 alerts and 9 physical services — cutting Sev-1 response time from 23 hours to under 3 minutes and ack time from 11 hours to under 3 minutes.
Built observability coverage across 68 logical services with 66 Splunk P0 alerts and 226 New Relic P0 alerts; developing customer-level observability dashboards using Splunk, Snowflake, and anomaly detection.
Developed RCA/COE automation framework (March 2024), reducing incident turnaround by 7 days and saving 100+ developer hours per cycle; established monthly COE governance cadence (June 2025) to track heat maps, root causes, and closure accountability.
Contributed to AI/ML-based anomaly detection (July 2024) that proactively identified critical anomalies pre-deployment, preventing outages during microservices migration.
Delivered AI-powered Slack and Zendesk operational bots (Oct 2024) automating runbook guidance and ticket summaries, saving 102 developer hours and 252 CSM hours monthly.
Integrated Cursor AI (Aug 2025) and pioneered Jira MCP integration, reducing debugging and root-cause analysis time from 1 day to ~2 hours — saving ~126 hours per quarter.
Built executive ThoughtSpot dashboards (Sep 2025) providing real-time visibility into SLA adherence, MTTR, backlog aging, and throughput, now used by leadership for performance tracking.
Improved independent ticket resolution rate from 75% (Q1 2024) to 87%, reducing engineering dependency; served as primary escalation owner for Tier-1 clients during critical incidents.
Architected DiagnostIQ, an AI agent using LangChain and Knowledge Base grounding to provide L1/L2 teams with automated diagnostic access to Splunk — reducing investigation time from 60 minutes to under 5 minutes with zero-hallucination accuracy.
Mentored team members into product advocates, driving 15–20 product improvements and 175+ documentation enhancements; authored 90+ runbooks to accelerate L2 technical ramp-up and self-service adoption.

Support Engineering Manager

Amazon

Chennai

10.2021 - 10.2023

Mentored 2 managers and 12 engineers, interviewing 130+ candidates and building high-performing Kindle support teams.
Achieved 85% self-service resolution across 8 applications, reducing engineering dependency.
Delivered 87% SLA compliance through improved incident handling and escalation management.
Led migration of 38 data pipelines during FDP to Cradle streaming transformation.
Managed migration of a legacy security system to CloudAuth + AAA architecture.
Co-authored Correction of Error (COE) frameworks to improve operational standards.
Led development of an ML-based auditing tool to categorize and prioritize support tickets, reducing 10k+ tickets annually.
Established standardized ticket management practices across Kindle support.
Led learning and capability programs improving communication, problem-solving, and technical leadership skills.

Senior Software Support Engineer

Amazon

01.2020 - 10.2021

Reduced KDP Authors Reporting Service 'KARS' initial ticket queue from 150 to less than 80 within a month and Brought queue to steady state '20' in a year by providing technical expertise
Developed a Python tool to migrate data between production AWS account to a beta AWS account, containing approximately 8,000 lines of code, in under a month by Strategic planning this technically challenging project
Played Project management role in feature release for GDPR compliance code delivery using agile methods and coded in Python to obfuscate data in 38 tables with nested primary key values in DynamoDB
Facilitated process improvement & standardization teams to reduce 50% incoming tickets to the KARS from the Customer Service & Support 'CSS' team
Impact: Developed Python tools to automate the re-driving of messages from Dead Letter Queues 'DLQ' to Simple Queue Service 'SQS', saving 720 tickets/ year
Analyzed fleet utilization, improving CPU utilization to 60% from 40%, resulting in annual savings of $7,568
Obtained security certifications for KARS applications in collaboration with legal, privacy, and security teams
Architected & prototyped a POC of upstream teams' messages to test the backend input & aggregation process for the new reporting architecture.

Service Engineer L2/L3

Microsoft

Redmond

08.2015 - 06.2018

Managed incident operations across 95,000 servers in hybrid cloud environments.
Administered Windows Server environments (2008R2, 2012, 2016) using SCOM clusters.
Supported infrastructure components including Active Directory, DNS, DHCP, and Hyper-V.
Managed patching, connectivity, VLAN configuration, and cluster operations.
Ensured SOX and CISM compliance across infrastructure services.

Award:
Received Star Certificate of Excellence for automating 700 monthly tickets using PowerShell, saving 20,000 man-hours annually.

Technical Business Analyst

CHI

Englewood

07.2019

Configured, provisioned compute & storage for 120 Windows and Linux servers & Managed project to virtualize leveraging cloud-based infrastructure for medical apps using Git, Jenkins, Docker, Chef, and Nagios using ESXi and Azure for effective business integration & sales.

Associate RF QA Engineer

Alcatel-Lucent Enterprise

Irvine

12.2014

Coordinated teams for LTE and UMTS drive testing, optimizing network KPIs and delivered reports.

Escalation support / application layer QA

Cisco

Chennai

11.2012

Ensured strong Communication & QA for Cisco's 'Service Control Engine' (a deep packet inspection module).

Education

Master of Science (MS), Electrical Engineering - Electrical engg

University of South Florida

Tampa

01.2015

Skills

Scripting & Coding
Python
PowerShell
Shell scripting
Cloud exposure
AWS
Azure
Hyper-V
V-sphere

Practices
Servant Leadership
IT Infrastructure Management
CI/CD Dev-ops
Splunk
PagerDuty
Serivcenow/Jira
KPI reporting (Thoughspot, snowflake, Sigma)
incident tooling

Certification

PMI-ACP®
CSPO

Timeline

Senior Engineering manager– Site Reliability & CX

Celigo

02.2024 - Current

Support Engineering Manager

Amazon

10.2021 - 10.2023

Senior Software Support Engineer

Amazon

01.2020 - 10.2021

Technical Business Analyst

CHI

07.2019

Service Engineer L2/L3

Microsoft

08.2015 - 06.2018

Associate RF QA Engineer

Alcatel-Lucent Enterprise

12.2014

Escalation support / application layer QA

Cisco

11.2012

Master of Science (MS), Electrical Engineering - Electrical engg

University of South Florida

ANIERUTAN BALAJI

Summary

Overview

Work History

Senior Engineering manager– Site Reliability & CX

Support Engineering Manager

Senior Software Support Engineer

Service Engineer L2/L3

Technical Business Analyst

Associate RF QA Engineer

Escalation support / application layer QA

Education

Master of Science (MS), Electrical Engineering - Electrical engg

Skills

Certification

Timeline

Senior Engineering manager– Site Reliability & CX

Support Engineering Manager

Senior Software Support Engineer

Technical Business Analyst

Service Engineer L2/L3

Associate RF QA Engineer

Escalation support / application layer QA

Master of Science (MS), Electrical Engineering - Electrical engg

Similar Profiles

Aby JoseAby Jose

GILBERTO ROSARIOGILBERTO ROSARIO

PRABHU JAMBULINGAMPRABHU JAMBULINGAM

Nagesh B SNagesh B S