Summary
Overview
Work History
Education
Skills
Timeline
Generic

T.H. Florance

Hyderabad

Summary

Site Reliability Engineer (SRE) and Major Incident Manager (MIM) with 8+ years of expertise in incident response, infrastructure monitoring, and operational reliability. Proven track record in establishing comprehensive major incident management processes, delivering training, and leading teams. Proficient in utilizing monitoring tools like PagerDuty, LogicMonitor, and New Relic to optimize observability, onboard applications and teams, and streamline alerting workflows.

Overview

8
8
years of professional experience

Work History

Sr.Site Reliability Engineer /Major Incident Manager

Zelis Healthcare
04.2024 - Current
  • Built and implemented the Major Incident Management (MIM) process from scratch, establishing streamlined workflows for incident resolution
  • Formed and managed a team of 6 associates, providing mentorship and leadership in handling critical incidents.
  • Trained Global Operations Center team and implemented Level 1 Incident Analysts.
  • Owned and optimized monitoring tools like PagerDuty, LogicMonitor, and New Relic, onboarding alerts, applications, and teams to ensure comprehensive observability and proactive issue resolution
  • Automated repetitive tasks in the incident management workflow using Power Automate, enhancing team productivity and response times
  • Collaborated closely with cloud, compute, network, database, IT operations, DevOps, and application development teams to resolve incidents and maintain operational stability
  • Managed P1/P2 incidents, conducted root cause analyses (RCA), and ensured timely follow-ups for action items to prevent recurrence.
  • Provided the incident insights to the stakeholders on Monthly operations meeting to understand the trend and implement corrective actions.

Major Incident Manager

Milestone Technologies (Client: Microsoft)
09.2023 - 03.2024
  • Served as the Major Incident Manager for Microsoft One Store, leading the resolution of P1 and P2 issues impacting global operations
  • Conducted RCA sessions for critical incidents and ensured proper tracking and completion of follow-up action items
  • Facilitated incident bridges, ensuring clear communication among engineering, IT, and operations teams
  • Worked extensively with Azure and Microsoft365 environments, addressing cloud infrastructure and application-level issues

Release Manager - Operations & DevOps

JD Sports Fashion India PLC
10.2022 - 09.2023
  • Managed end-to-end release processes for an e-commerce platform, minimizing downtime and ensuring seamless deployments
  • Worked with tools like New Relic, LogicMonitor, and Grafana for proactive monitoring and alert configuration
  • Resolved infrastructure and application issues by collaborating with development and operations teams
  • Improved observability by onboarding key metrics and configuring dashboards to monitor performance and availability

Product Support Engineer

Amazon Development Center
08.2021 - 09.2022
  • Managed and mentored a team of 15 associates for 3 months as part of a buddy-up program, fostering skill development and improving team efficiency
  • Provided Tier-2 technical support for Amazon's digital devices and games, ensuring timely resolution of hardware, software, and network issues
  • Used tools like internal ticketing systems (Omni tool) and APM tools to troubleshoot and resolve performance issues
  • Documented resolutions and escalated complex issues to product development teams for further investigation

Technical Support Engineer

Induct Solutions Pvt Ltd
03.2017 - 08.2021
  • Delivered comprehensive technical support for enterprise applications, systems, and hardware
  • Implemented incident management processes to streamline ticket resolution workflows and reduce downtime
  • Worked with tools like ServiceNow to manage and monitor incidents effectively
  • Conducted system maintenance, optimized configurations, and provided user training to enhance operational efficiency

Education

Bachelor of Science - Electronics And Communications Engineering

St. Mary’s Women’s Engineering College

Skills

    Incident Management: Incident response process design, team leadership, RCA, and action follow-ups

    Infrastructure Monitoring: Tool ownership and management (PagerDuty, LogicMonitor, New Relic)

    Observability: Application and infrastructure monitoring, alert onboarding, team onboarding

    Cross-Functional Collaboration: Cloud, compute, network, database (DB), IT operations, DevOps, and application development teams

    Workflow Automation: Streamlined repetitive tasks using Power Automate

    Monitoring Tools: Grafana, Kibana, PagerDuty, LogicMonitor, New Relic

    Soft Skills: Stakeholder Management, Team Leadership, Communication

Timeline

Sr.Site Reliability Engineer /Major Incident Manager

Zelis Healthcare
04.2024 - Current

Major Incident Manager

Milestone Technologies (Client: Microsoft)
09.2023 - 03.2024

Release Manager - Operations & DevOps

JD Sports Fashion India PLC
10.2022 - 09.2023

Product Support Engineer

Amazon Development Center
08.2021 - 09.2022

Technical Support Engineer

Induct Solutions Pvt Ltd
03.2017 - 08.2021

Bachelor of Science - Electronics And Communications Engineering

St. Mary’s Women’s Engineering College
T.H. Florance