Summary
Overview
Work History
Education
Skills
Accomplishments
Timeline
Generic

Florance T. H

Hyderabad

Summary

Service Reliability Engineer with 8+ years of experience supporting large-scale, enterprise production systems. Proven expertise in incident response, Site Reliability Engineering (SRE) principles, service monitoring, alerting, RCA, and operational excellence. Strong background in SLI, SLO, SLA governance, automation to reduce operational toil, and cross-functional collaboration with Engineering, Cloud, Network, and Database teams. Known for calm execution during high-severity incidents, and driving long-term reliability improvements.

Overview

9
9
years of professional experience

Work History

Sr. SRE/Incident and Problem Management Lead

Zelis Healthcare
Hyderabad
04.2024 - Current
  • Own end-to-end incident and problem management for business-critical production services, leading P1/P2 incidents across cloud, network, database, and application layers.
  • Act as Incident Commander during high-impact outages, ensuring rapid mitigation, clear stakeholder communication, and strict SLA adherence.
  • Lead Problem Management initiatives, identifying recurring and chronic issues through trend analysis, RCA correlation, and incident pattern review.
  • Apply SRE principles to improve service reliability, availability, operational readiness, and long-term stability.
  • Own and optimize monitoring and alerting systems (PagerDuty, New Relic, LogicMonitor), improving signal quality, and reducing MTTA/MTTR.
  • Conduct 100+ structured RCAs, driving corrective and preventive actions (CAPA) to eliminate repeat failures and systemic reliability gaps.
  • Define, track, and report SLIs, SLOs, and reliability metrics, delivering regular service health, and performance insights to leadership.
  • Automate incident and problem workflows, reporting, and follow-ups to reduce operational toil, and improve response efficiency.
  • Partner with the Engineering, Cloud, Network, Database, and Product teams to improve runbooks, escalation paths, service operability, and resilience.
  • Lead and mentor a global 24/7 incident and problem response team, supporting on-call operations and reliability standards.

Major Incident Manager (Escalations Lead – Microsoft OneStore)

Milestone Technologies
Hyderabad
09.2023 - 03.2024
  • Led incident response for large-scale, distributed cloud services supporting Microsoft OneStore.
  • Coordinated global engineering, operations, and product teams across Azure and Microsoft 365 platforms.
  • Drove high-severity incident mitigation, RCA execution, and closure of action items.
  • Delivered concise, real-time updates to senior stakeholders during critical production incidents.
  • Translated incident learnings into problem management initiatives to improve long-term service reliability.

Release Manager – Operations & DevOps

JD Sports Fashion India PLC
Hyderabad
10.2022 - 09.2023
  • Managed production releases with a focus on service stability, reliability, and downtime reduction.
  • Strengthened observability and monitoring using Grafana, LogicMonitor, and New Relic.
  • Collaborated with DevOps and Engineering teams to resolve infrastructure and application performance issues.
  • Supported incident response and post-release stability improvements for enterprise systems.

Technical Support Engineer

Amazon Development Center
Hyderabad
08.2021 - 09.2022
  • Provided Tier-2 production support for enterprise-scale services.
  • Investigated hardware, software, and network issues impacting customer-facing systems.
  • Escalated defects to engineering, and improved operational documentation.
  • Collaborated with engineering teams to resolve complex technical problems effectively.
  • Assisted in training new team members on company systems and procedures.
  • Resolved complex technical problems through root cause analysis techniques.

Technical Support Analyst

Induct Solutions Pvt Ltd
Hyderabad
03.2017 - 08.2021
  • Delivered enterprise technical support and incident triage.
  • Improved operational efficiency through documentation and user training.
  • Analyzed client feedback to improve support services and user experience.
  • Managed ticketing system and prioritized support requests effectively.

Education

MTech -

Skills

  • Site reliability engineering
  • Service availability engineering
  • Incident response management
  • Root cause analysis
  • SLA and SLO management
  • Monitoring and alerting
  • Performance observability
  • On-call operations
  • Automation strategies
  • ITIL management
  • Distributed systems support
  • Cross-functional collaboration
  • Operational documentation

Accomplishments

- Built and scaled Major Incident Management and Problem Management program from scratch.
- Resolved 250+ P1/P2/P3 incidents with strong SLA adherence.
- Conducted 100+ RCAs driving measurable reliability improvements.
- Reduced MTTA/MTTR through automation and alert optimization.
- Led global on-call and escalation operations for mission-critical services.

Timeline

Sr. SRE/Incident and Problem Management Lead

Zelis Healthcare
04.2024 - Current

Major Incident Manager (Escalations Lead – Microsoft OneStore)

Milestone Technologies
09.2023 - 03.2024

Release Manager – Operations & DevOps

JD Sports Fashion India PLC
10.2022 - 09.2023

Technical Support Engineer

Amazon Development Center
08.2021 - 09.2022

Technical Support Analyst

Induct Solutions Pvt Ltd
03.2017 - 08.2021

MTech -

Florance T. H