Summary
Overview
Work History
Education
Skills
Software
Timeline
Generic

KORUPOLU MANOJ KUMAR

Lead Site Reliability Engineer | Lead Platform & Production Operations
Hyderabad

Summary

Accomplished Lead Site Reliability Engineer with a decade of expertise in Production Support Operations, SRE best practices, and large-scale application transitions. Proven ability to build and lead high-performing global teams, drive incident, problem, and change management, and ensure seamless platform reliability across cloud, hybrid, and on-prem environments. Expert in observability, automation, performance tuning, and database management, optimizing system uptime, security, and operational efficiency. Passionate about solving complex technical challenges, enhancing system resilience, and driving continuous improvement in production environments.

Overview

12
12
years of professional experience
6
6
years of post-secondary education
3
3
Languages

Work History

LEAD Site Reliability Engineer

EPAM Systems, Inc
Hyd
02.2024 - Current
  • Built and Scaled High-Performing Teams: Established and led a team of 15+ engineers from the ground up, driving excellence in Site Reliability Engineering (SRE) and operational best practices.
  • End-to-End Application Transition & Ownership: Spearheaded the seamless transition of critical supply chain applications from external vendors, ensuring zero disruption for one of Canada's largest retail giants. Developed strategic frameworks for knowledge transfer, process optimization, and automation, enhancing system efficiency and maintainability.
  • Driving SRE & Observability Excellence: Led the observability strategy and SRE operations end-to-end, integrating monitoring, alerting, and incident response to improve system uptime, reliability, and performance.
  • Multi-Platform System Management: Oversaw and optimized diverse application landscapes running on Linux, AIX, Windows, Azure, and SaaS platforms, ensuring high availability and compliance with industry best practices.
  • Database Administration & Optimization: Managed Oracle, Azure SQL, MS SQL, MongoDB, and PostgreSQL instances, ensuring optimal performance, scalability, and security for mission-critical applications.
  • Robust IT Service & Incident Management: Led Change Management, Problem Management, Incident Management, and Major Incident Management with a proactive approach, minimizing downtime and driving faster resolution times.
  • DevOps & Automation Expertise: Worked with Azure DevOps (ADO), Jenkins, Chef, Bitbucket, Key Pass, and Vault to streamline CI/CD pipelines, automate deployments, and enhance security.
  • Enterprise-Grade Monitoring & Logging: Managed Sumo Logic, New Relic, and Tivoli to enhance system observability, proactively detect issues, and optimize application performance.
  • Agile & ITSM Leadership: Spearheaded ITSM processes in ServiceNow, ensuring efficient ticket resolution, change tracking, and SLA adherence. Strategized sprint planning and workload distribution in Jira, enhancing productivity and project execution.
  • Operational Excellence in Supply Chain Systems: Directed and maintained critical supply chain functions, including Order Management Systems (OMS), Warehouse Management Systems (WMS), Distribution Center (DC) Operations, Time Management Operations, AGV (Automated Guided Vehicle) Operations, Carrier Operations, and Transportation Management Systems (TMS), ensuring seamless logistics and order fulfillment.
  • Evaluated new technologies and tools to enhance overall system performance, stability, and security.
  • Developed custom scripts/tools as needed to automate routine tasks, increasing overall team productivity and efficiency.
  • Collaborated with cross-functional teams to develop, test, and deploy scalable software solutions.

LEAD Site Reliability Engineer

Petco
Hyd
08.2023 - 02.2024
  • Built and Led a High-Impact SRE Team from Scratch: Successfully established and managed a high-performing SRE team, driving best practices in site reliability, automation, and operational excellence to support Petco critical e-commerce platform.
  • Seamless Transition of Applications & Infrastructure: Spearheaded the end-to-end transition of e-commerce applications and infrastructure from third-party vendors, ensuring zero downtime, enhanced scalability, and long-term sustainability.
  • Proactive Monitoring & Incident Resolution: Led real-time monitoring, alerting, and incident response strategies, ensuring the platform’s high availability, security, and performance while minimizing customer impact.
  • Infrastructure Optimization & High Availability: Managed and optimized servers, databases, and network resources, ensuring scalability, reliability, and fault tolerance of the e-commerce site during high-traffic events.
  • End-to-End Automation & DevOps Implementation: Developed and maintained automation scripts and tools for configuration management, deployment, and resource provisioning, reducing manual effort and increasing efficiency.
  • Capacity Planning & Peak Readiness: Analyzed traffic patterns and system usage to forecast demand, scaling resources dynamically to handle seasonal spikes and promotional events.
  • Performance Tuning & Optimization: Identified and resolved application performance bottlenecks, implementing optimizations that enhanced site speed, transaction processing, and user experience.
  • Security & Compliance Leadership: Collaborated with security teams to implement robust security practices, conduct vulnerability assessments, and ensure PCI DSS compliance, safeguarding customer data and transactions.
  • Disaster Recovery & Business Continuity: Designed and maintained disaster recovery plans and backup strategies, ensuring rapid recovery and data integrity in case of system failures.
  • CI/CD & Agile Release Management: Led CI/CD pipeline implementation, streamlining deployments and ensuring smooth feature rollouts with minimal downtime.
  • Cross-Functional Collaboration & Documentation: Worked closely with developers, QA engineers, and product managers, documenting system configurations, best practices, and incident response protocols to enhance team knowledge and onboarding.
  • 24/7 Operational Support & On-Call Management: Played a key role in the on-call rotation, providing round-the-clock support for critical incidents and ensuring business continuity for Petco e-commerce operations.

LEAD Operations Engineer

Gap Inc
Hyd
12.2021 - 08.2023

Additional Impactful Contributions

  • Built & Led a High-Performing Operations Team: Established and managed a technical team from the ground up, ensuring end-to-end support for critical enterprise applications, including CA Workload Automation and other infrastructure components.
  • Seamless Transition & Migration Leadership: Spearheaded the transition of applications across multiple environments, ensuring zero disruption while improving scalability, automation, and long-term sustainability.
  • Enterprise Workload Automation & Scheduling: Led the monitoring, scheduling, and administration of the CA Workload Automation tool, optimizing job execution and ensuring on-time batch processing.
  • Disaster Recovery & Business Continuity: Designed and maintained robust disaster recovery (DR) and business continuity (BCP) plans, ensuring minimal downtime and data integrity during outages.
  • Infrastructure & Security Management: Managed system upgrades, patches, and security controls, implementing firewalls, access controls, and antivirus solutions to protect critical enterprise systems.
  • Performance Optimization & Proactive Monitoring: Conducted performance tuning and system optimization for CA Workload Automation, improving efficiency and reducing failures. Monitored batch jobs, alerts, and system health using Nagios to ensure high availability.
  • Incident & Change Management Expertise: Led incident response (P1-P4, G-1, G-2), change management, and risk mitigation strategies, collaborating with onshore, offshore, and client teams to ensure smooth operations.
  • Cross-Functional Collaboration & Agile Leadership: Worked closely with cross-functional teams to align technical requirements, improve operational processes, and drive continuous service improvements (CSI) through retrospectives and post-mortems.
  • Automation & DevOps Implementation: Designed batch automation scripts, optimized job scheduling, and streamlined non-prod environment cleanups to enhance system stability and reduce manual efforts.
  • Comprehensive Reporting & Dashboarding: Created business management reviews (BMRs), dashboards, and incident reports to track system health, identify trends, and proactively address potential risks.
  • Technical Mentorship & Team Development: Mentored and trained junior engineers, fostering skill development and knowledge sharing to enhance overall team efficiency.

✔ Led migration activities across environments, ensuring seamless transitions.
✔ Conducted regular health checks (weekly/monthly) to maintain system stability.
✔ Managed pathing activities and DST (Daylight Savings Time) adjustments.
✔ Handled XML conversions, data analysis, and system integrations.
✔ Ensured adherence to ITIL best practices, handling story-based and ServiceNow-based requests.
Facilitated risk calls with client, onshore, and offshore teams to assess and mitigate potential threats.

Senior Technical Support Engineer

ServiceNow
Hyd
07.2021 - 12.2021
  • Expert-Level L3 Technical Support & Customer Success: Served as an L3 Senior Technical Support Engineer, diagnosing, troubleshooting, and resolving complex ServiceNow platform issues to ensure high availability, performance, and reliability.
  • High-Severity Incident Management: Led the resolution of critical performance issues (P1/P2 cases), including tool slowness, unavailability, and database bottlenecks, ensuring swift mitigation and minimal business disruption.
  • Performance Tuning & Optimization: Analyzed and optimized JVM performance, CPU utilization, and memory management using MAT (Memory Analyzer Tool), reducing crashes and improving stability.
  • Database Administration & Performance Enhancement: Provided DBA solutions, resolving database sync replica issues, indexing problems, fragmentation, and bad query performance, significantly improving query response times and database efficiency.
  • Proactive Debugging & Root Cause Analysis: Conducted in-depth Java thread analysis and stack trace debugging, resolving thread contention, deadlocks, and performance bottlenecks to enhance platform responsiveness.
  • Customer-Centric Issue Resolution: Took complete ownership of customer-reported issues, ensuring timely resolution and providing prompt, accurate feedback to drive customer satisfaction and retention.
  • Knowledge Base & Documentation Leadership: Developed comprehensive tech notes, troubleshooting guides, and knowledge articles, contributing to the ServiceNow knowledge base to enhance team-wide efficiency.
  • SLA-Driven Support & Service Excellence: Ensured strict adherence to SLAs, prioritizing issues based on severity, and implementing proactive monitoring and maintenance strategies to prevent recurring problems.
  • Custom Application Support & Maintenance: Provided ongoing support for custom applications, optimizing configurations and guiding customers through best practices for performance stability and scalability.

Senior Operations Engineer

Wells Fargo
Hyd
06.2020 - 07.2021
  • L2 & L3 Platform Support for Critical Banking Applications: Provided end-to-end platform support for 7+ commercial card banking applications, ensuring high availability, optimal performance, and security.
  • Multi-Technology Expertise & Middleware Support: Managed applications running on Unix, Linux, Java, WebLogic, and WebSphere, ensuring seamless integration and efficient middleware performance.
  • Comprehensive Patching & System Maintenance: Led OS patching, database patching, and middleware services updates, strengthening system security, stability, and compliance.
  • Advanced Application & Infrastructure Monitoring: Utilized AppDynamics, Genios, and Splunk for proactive performance monitoring, log analysis, and troubleshooting, reducing downtime and enhancing user experience.
  • Database & ETL Optimization: Provided extensive Oracle Exadata DB support, troubleshooting database-related performance issues and optimizing ETL (Informatica) processes for efficient data transformation.
  • Automation & Scheduling for Efficiency: Leveraged SSP, HPOO, and Autosys to automate workflows, manage batch job scheduling, and optimize system performance.
  • Incident, Change, & Release Management Leadership: Applied ITIL best practices, ensuring swift resolution of incidents (P1-P4), seamless change rollouts, and structured problem management, minimizing service disruptions.
  • Code Deployment & Application Maintenance: Assisted with defect fixes, bug releases, application patching, upgrades, and migrations, collaborating closely with development teams to enhance system reliability.
  • Business Continuity & Disaster Recovery Support: Managed network switch replacements, traffic routing, and production-to-BCP environment transitions, ensuring uninterrupted business operations.
  • Data Quality & Compliance Excellence: Conducted data analysis, data quality checks, and investigations, resolving client queries while maintaining compliance with banking industry standards.
  • Cross-Functional Communication & Stakeholder Collaboration: Effectively coordinated with management, development teams, and end users, ensuring transparency and efficiency in support operations.
  • Network & Traffic Management Expertise: Utilized F5 tools and App ViewX to monitor and optimize network traffic flow and application performance across production and business continuity environments.

Senior Operations Engineer

Gap Inc
Hyd
11.2017 - 06.2020
  • Enterprise Operations & Incident Management: Led L2/L3 application support, efficiently managing incidents, changes, and problem resolutions while ensuring high system availability and minimal disruptions.
  • End-to-End System Monitoring & Automation: Implemented Nagios-based alert monitoring, automated batch processes using JavaScript, and developed Unix scripts for proactive health monitoring of databases and servers across production and non-production environments.
  • Seamless Application Maintenance & Patch Management: Oversaw patching, re-installation, and configuration of critical applications, ensuring optimal performance, security compliance, and system stability.
  • Database & Infrastructure Migration Leadership: Managed database migrations, batch flow monitoring, and system cutovers, ensuring seamless transitions with minimal downtime while enhancing performance.
  • Risk Mitigation & Stakeholder Collaboration: Conducted risk calls with onshore, offshore, and client teams, analyzing migration impacts and comparing R12 vs. R11 feature enhancements to drive data-driven decisions.
  • Incident Response & Process Optimization: Integrated PagerDuty for real-time incident management, improving alerting efficiency and response times for critical issues.
  • Oracle Exadata Expertise & Database Synchronization: Provided Exadata support, including archival, purge processes, database synchronization for reporting, and retention period analysis, optimizing data management.
  • ServiceNow Reporting & Business Insights: Designed ServiceNow reports, dashboards, and Business Modification Requests (BMRs) for real-time incident tracking and operational insights, ensuring data-driven decision-making.
  • Cross-Functional Collaboration & Client Support: Partnered with management, developers, and end-users to conduct data quality checks, investigate client queries, and resolve system discrepancies, strengthening business operations.
  • Vendor & Tool Evaluation: Performed market analysis and tool comparisons, recommending best-fit solutions to enhance operational efficiency and system capabilities.
  • Implemented continuous improvement initiatives for increased overall operational effectiveness and reduced waste.

Senior Analyst

Tcs Synergy Park
Hyd
08.2013 - 11.2017
  • ETL & Data Management for Credit Card Systems: Worked on Bacardi, the central credit card data repository, ensuring seamless transformation and storage of consumer credit card data from TSYS into Vertica & DB2 databases.
  • End-to-End Data Processing & Transmission: Developed ETL (Datastage) pipelines to process and transmit critical financial data to downstream applications based on business requirements.
  • Performance Optimization & Analytics: Utilized Bacardi data for analytics, improving data accessibility, reporting, and compliance.
  • Software Development Lifecycle (SDLC) Management: Designed, tested, and deployed Datastage ETL jobs, collaborating with onsite teams for production releases and resolving end-user issues.
  • Database & Unix Scripting Expertise: Developed and optimized UNIX shell scripts, ensuring efficient data processing and automated task execution in ETL workflows.
  • 24/7 Production & Batch Job Monitoring: Ensured batch jobs adhered to schedules, minimizing delays and system downtime through proactive monitoring.
  • Incident, Change & Problem Management: Led critical incident resolution (P1-P4 issues), ensuring rapid troubleshooting and mitigation of infrastructure, batch, and application issues.
  • Cross-Team Risk & Issue Resolution: Organized and led risk calls with onshore, client, and offshore teams, proactively mitigating risks and ensuring service continuity.
  • System Health & Maintenance Oversight: Provided round-the-clock support in a shift-based model, handling maintenance, release management, and system stability activities.
  • Data Quality Assurance & Client Issue Resolution: Conducted data analysis, quality checks, and investigations, swiftly addressing client queries and discrepancies.
  • Stakeholder Communication & Problem Reporting: Maintained clear and effective communication with business leaders, developers, and end-users, ensuring transparency in problem resolution and system updates
  • Enhanced team productivity by streamlining workflow processes and implementing time-saving strategies.

Education

MBA - Business Analytics

SRM
Chennai
01.2023 - 01.2025

Bachelor of Technology - Electrical, Electronics And Communications Engineering

GITAM University
Visakhapatnam, India
06.2009 - 05.2013

Skills

Log analysis

Security best practices

Performance tuning

Network troubleshooting

Incident management

System monitoring

ITIL framework

Operations management

Database administration

Scheduling and planning

Infrastructure automation

Problem-solving

Software

Unix/Linux

Windows

Oracle

Microsoft Azure

Chef

Ansible

GITHub

Jenkins

ADO

Bit bucket

Terraform

Shell Scripting

Python

Data Stage

Informatica

CA Workload Automation ESP

Control M

Autosys

MAT

Splunk

New Relic

Sumo Logic

App Dynamics

Genios

Nagios

Service Now

Jira

WebSphere

WCS(Web Commerce Sever)

Cloud Fare(CDN)

Web Servers(Apache Tomcat,IIS)

JBOSS

Developer Tools

SQL Developer

TOAD

Kubernetes

Grafana

O365 Applications

IBM Db2

Postgres

Mysql

Azure SQL

Mongo DB

Timeline

LEAD Site Reliability Engineer

EPAM Systems, Inc
02.2024 - Current

LEAD Site Reliability Engineer

Petco
08.2023 - 02.2024

MBA - Business Analytics

SRM
01.2023 - 01.2025

LEAD Operations Engineer

Gap Inc
12.2021 - 08.2023

Senior Technical Support Engineer

ServiceNow
07.2021 - 12.2021

Senior Operations Engineer

Wells Fargo
06.2020 - 07.2021

Senior Operations Engineer

Gap Inc
11.2017 - 06.2020

Senior Analyst

Tcs Synergy Park
08.2013 - 11.2017

Bachelor of Technology - Electrical, Electronics And Communications Engineering

GITAM University
06.2009 - 05.2013
KORUPOLU MANOJ KUMARLead Site Reliability Engineer | Lead Platform & Production Operations