Summary
Overview
Work History
Education
Skills
Timeline
Hobbies and Interests
Generic

Shrey Mehrotra

Gurugram,Haryana

Summary

Incident, Change, Problem, and Service Reliability Engineer with 6+ years of experience improving service uptime, operational stability, and reliability in enterprise environments. Proven track record of reducing MTTR by driving structured incident resolution, optimizing monitoring/alerting pipelines, and ensuring adherence to SLOs and error budgets. Adept at leading cross-functional bridges, performing RCA/PIRs, and implementing preventive measures that cut repeat incidents by 25%+. Strong communicator skilled in stakeholder management, change governance, and post-incident reviews.

Overview

6
6
years of professional experience

Work History

INCIDENT RESPONSE ANALYST (Incident, Problem and Change Manager)

Cosm Technologies India Pvt Ltd
04.2024 - Current
  • Led end-to-end incident lifecycle for P1/P2 outages across AWS, GCP and on-prem infrastructure, coordinating resolver teams and ensuring SLA compliance.
  • Implemented and optimized real-time telemetry and alerting pipelines for SaaS platforms using Prometheus, CloudWatch, Grafana, and related tools, achieving 40% reduction in MTTD.
  • Acted as the primary point of contact during critical incidents, performed first level troubleshooting, ensured >99.95% service uptime by swiftly resolving P1/P2 incidents via ServiceNow workflows, and leading cross-functional bridges for incident resolution.
  • Delivered accurate, timely communication to stakeholders including customers and vendors.
  • Conducted root cause analysis (RCA) and post-incident reviews (PIR) with stakeholders from Product, Engineering, and Support, leading to 25% fewer repeat incidents.
  • Ensured strict adherence to escalation protocols while maintaining high-quality incident documentation, SOPs, and technical guides.
  • Reduced resolution time by 30% through streamlined processes and effective leadership during outages.
  • Integrated SLO and error budget tracking into monitoring systems to guide operational priorities.

Senior Incident & Problem Management

British Telecom
09.2021 - 04.2024
  • Led end-to-end lifecycle of P1/P2 incidents, achieving >95% SLA adherence and ensuring rapid mitigation and resolution.
  • Participated in on-call rotations for high-severity events, performing first-line fixes and escalating with complete technical context.
  • Managed 24×7 monitoring of enterprise LAN/WAN, VPNs, firewalls, and cloud workloads.
  • Delivered real-time updates and periodic reports to key stakeholders during major outages.
  • Conducted impact and risk assessments to prioritize incidents based on urgency, business impact, and SLAs.
  • Facilitated Post-Incident Reviews (PIRs), collaborated with cross-functional teams on preventive measures, and ensured resolution steps were captured in knowledge bases.
  • Created Major Incident Reports (MIRs), RCA documentation, and maintained detailed records in ServiceNow.
  • Developed proactive alerts in LogicMonitor, ThousandEyes, and GCP Monitoring to catch degradations before impact.
  • Assisted in service failover testing for DR readiness.
  • Delivered training and knowledge-sharing sessions across teams and new joiners to reduce recurrence of major incidents.
  • Drove trend analysis and proactive strategies to enhance service reliability and reduce incident volume.

Service Reliability Engineer(SRE)

British Telecom
10.2020 - 09.2021
  • Responsible for providing troubleshooting for open incidents related to network issues including firewalls, VPNs, routers, and switches.
  • Optimized router settings for customer-specific and failover requirements, while creating RCAs, SOPs, and runbooks to reduce incident recurrence.
  • Troubleshooting experience on network performance like QoS, Bandwidth policing, Traffic shaping, Latency, Jitter, bandwidth utilisation.
  • Implemented end-to-end service monitoring for cloud applications using Prometheus, CloudWatch, and Grafana, improving MTTD by 35%.
  • Defined and tracked SLOs, SLIs, and error budgets for critical services, providing actionable insights to development and product teams.
  • Tracked transmission alarms in NMS, diagnosed affected circuits, and executed remediation plans in coordination with MPLS core and transmission teams.
  • Products we support for these customers are MPLS, Internet, ISDN, ADSL/DSL, VoIP etc. and handle escalated cases and ensure customers for timely resolution SLA management, and incident quality management.
  • Technical troubleshooting with circuit providers and hardware (Router and Switches) vendors and on data circuit connectivity like serial and Ethernet.

Technical Content Developer

PrepInsta Technologies Pvt Ltd
04.2019 - 09.2020
  • Prepared SEO-based Content for the website related to the company's product.
  • Prepared content and questions related to Logical, Aptitude and English.
  • Provided mentorship to summer interns on company’s products.
  • Awarded for the best SEO page.

Education

B.tech - Electronics and Communications

Dr. A. P. J. Abdul Kalam Technical University
Raj Kumar Goel Institute Of Technology
06.2019

Skills

  • Major Incident Management (MIM)
  • Problem Management (RCA & PIR)
  • ITIL Framework (Incident, Change, Problem)
  • Stakeholder Communication
  • Team Collaboration and Leadership
  • ServiceNow, Jira
  • Experienced in managing AWS cloud solutions(EC2, IAM, RDS, ELB, etc)
  • Tools: SolarWinds, Grafana, Prometheus, Sevone, ThousandEyes, Cloud watch, NPMD, PuTTY, Cisco Meraki, Palo Alto
  • SLA Management & Escalation Handling
  • CAB & Change Coordination
  • Automation Escalation Workflows
  • On-Call Coordination / 24x7 Support
  • Protocols: TCP/IP OSPF, BGP, EIGRP, STP, DTP, VTP, VLAN, VPN, MPLS, NAT, TACACS
  • Network and Security Concepts like CCNA, CCNP, PCNSE
  • Operating System- Linux, windows
  • Languages: JAVA, Python(basic)
  • Good understanding of REST APIs
  • Good knowledge of Microsoft Excel, PowerPoint and Data Management Systems(MySQL, PostgreSQL, Oracle DB)

Timeline

INCIDENT RESPONSE ANALYST (Incident, Problem and Change Manager)

Cosm Technologies India Pvt Ltd
04.2024 - Current

Senior Incident & Problem Management

British Telecom
09.2021 - 04.2024

Service Reliability Engineer(SRE)

British Telecom
10.2020 - 09.2021

Technical Content Developer

PrepInsta Technologies Pvt Ltd
04.2019 - 09.2020

B.tech - Electronics and Communications

Dr. A. P. J. Abdul Kalam Technical University

Hobbies and Interests

Traveling, Gym, Gaming, Watching movies, series, Learning new skills
Shrey Mehrotra