
Results-driven Site Reliability Engineer with over 16 years of IT experience, including more than 7 years dedicated to leading DevOps, Cloud Infrastructure, and SRE initiatives. Expertise in architecting highly available, scalable, and secure systems utilizing AWS and Kubernetes. Proficient in designing SLO-driven operations, automating incident response processes, and spearheading large-scale reliability programs across diverse global teams. Committed to enhancing system performance and driving operational excellence through innovative solutions and collaborative teamwork.
Spearheaded SRE transformation, introducing SLO-based operations, and reducing incident MTTR by 35%.
• Architected a secure AWS multi-account setup using private VPCs, VPC Endpoints, and EKS clusters.
• Built Terraform-based automation pipelines, improving infrastructure provisioning time by 60%.
• Implemented the Prometheus and Grafana monitoring stack, integrated with CloudWatch and Kiali, for full observability.
• Led a 6-member SRE/DevOps team; established on-call rotations, and automated runbook remediation.
• Implemented CI/CD pipelines using Jenkins and GitLab CI for microservices deployed on OpenShift and EKS.
• Developed IaC modules in Terraform Cloud, enabling consistent infra deployments across environments.
• Deployed centralized logging and alerting systems using CloudWatch, Fluentd, and Grafana.
• Introduced automated backup, scaling, and health-check systems, improving uptime by 25%.
• Enhanced Jenkins build pipelines with parameterized deployments and Groovy DSL.
• Supported global-scale EKS and EC2 deployments, ensuring HA, and DR readiness.
• Reduced infrastructure drift by introducing Terraform modules with CI-based validation.
• Partnered with SRE teams to implement postmortem reviews and reliability metrics tracking.
CI/CD implementation expertise