Site Reliability Engineer (SRE) and Major Incident Manager (MIM) with 8+ years of expertise in incident response, infrastructure monitoring, and operational reliability. Proven track record in establishing comprehensive major incident management processes, delivering training, and leading teams. Proficient in utilizing monitoring tools like PagerDuty, LogicMonitor, and New Relic to optimize observability, onboard applications and teams, and streamline alerting workflows.
Incident Management: Incident response process design, team leadership, RCA, and action follow-ups
Infrastructure Monitoring: Tool ownership and management (PagerDuty, LogicMonitor, New Relic)
Observability: Application and infrastructure monitoring, alert onboarding, team onboarding
Cross-Functional Collaboration: Cloud, compute, network, database (DB), IT operations, DevOps, and application development teams
Workflow Automation: Streamlined repetitive tasks using Power Automate
Monitoring Tools: Grafana, Kibana, PagerDuty, LogicMonitor, New Relic
Soft Skills: Stakeholder Management, Team Leadership, Communication