
SRE Manager, with 21+ years of global experience across the UAE, USA, and India, leading large-scale reliability, cloud, and platform engineering initiatives for high-availability enterprise systems. Proven track record of building and mentoring SRE teams, owning 24×7 incident and security on-call operations, and acting as Incident Commander for major outages and security events.
Deep expertise in designing and operating resilient, scalable platforms across Azure, AWS, VMware, and Kubernetes/OpenShift environments, with strong emphasis on SLO-driven engineering, error budgets, and automation-first operations. Successfully implemented self-healing systems, automated remediation, and proactive observability to significantly reduce MTTR and operational toil.
Strong background in DevSecOps and compliance-driven environments (PCI-DSS, GDPR), embedding security controls and automated checks into CI/CD pipelines. Experienced in performance optimization, disaster recovery, and enterprise-scale traffic management using CDN and WAF platforms. Known for aligning reliability engineering with business goals, customer experience, and sustainable platform growth.
Enterprise Observability Platform Implementation:Led the design and rollout of a centralized observability platform spanning metrics, logs, traces, and alerts across hundreds of services. Standardized SLIs, SLOs, dashboards, and alerting policies, enabling proactive issue detection and significantly improving incident response and operational visibility across teams.
SLO-Driven Reliability Governance & Error Budgets:Introduced SLO- and error-budget–driven reliability governance to enable data-driven trade-offs between feature velocity and system stability. Embedded error budgets into release decisions, reducing repeat production incidents and improving overall service reliability.
Incident Management, On-Call Health & Alert Quality Optimization:Designed and governed 24×7 incident management and on-call practices, defining clear escalation paths, incident command roles, and RCA standards. Improved alert quality and reduced on-call fatigue, strengthening response effectiveness during major production incidents.
AI-Assisted Alert Noise Reduction & On-Call Optimization:Led adoption of AI-assisted anomaly detection and alert correlation to reduce false positives and alert noise. Improved signal-to-noise ratio, accelerated incident triage, and enabled engineers to focus on high-impact reliability issues.
DevSecOps & Shift-Left Reliability Enablement:Owned the integration of automated security and quality controls into CI/CD pipelines, embedding reliability and compliance checks early in the delivery lifecycle. Reduced production risk while maintaining delivery velocity in regulated environments.