
AWS-focused Site Reliability Engineer with 6+ years of experience managing scalable cloud infrastructure across AWS environments. Expertise in SRE practices, infrastructure automation (Terraform), Kubernetes (EKS), and observability (Prometheus, Grafana, CloudWatch). Proven ability to maintain 99.9% uptime, reduce deployment time by 40%, and optimize cloud cost through FinOps strategies.
• Managed large-scale AWS infrastructure (EC2, S3, VPC, IAM, EKS) across environments, ensuring high availability and fault tolerance.
• Maintained 99.9% uptime SLA through proactive monitoring, alerting, and incident response
• Implemented SRE practices (SLA/SLO monitoring, observability, incident management) to improve system reliability
• Designed and built CI/CD pipelines using Jenkins and Azure DevOps, reducing deployment time by 40%
• Automated infrastructure provisioning using Terraform (modular approach) and Ansible, reducing manual effort by 60%
• Deployed and managed Kubernetes clusters (EKS) to support containerized microservices with auto-scaling and self-healing
• Built centralized monitoring and observability systems using Prometheus, Grafana, and AWS CloudWatch
• Configured real-time alerts and dashboards to track system health, performance, and SLA compliance
• Led incident response and root cause analysis (RCA), significantly reducing MTTR (Mean Time to Recovery)
• Developed Python and shell scripts to automate log analysis, operational tasks, and resource optimization
• Migrated on-premise workloads to AWS, improving scalability, reliability, and operational efficiency
• Optimized AWS cloud costs using FinOps strategies, including resource rightsizing and auto-scaling
• Implemented IAM roles and policies following least privilege principles to enhance security posture
• Built and maintained Docker images and deployed workloads into Kubernetes environments.
• Managed GitHub repositories, implementing branching strategies, code reviews, and CI/CD integrations
• Designed Grafana dashboards integrated with Prometheus for real-time observability and performance insights
• Supported production systems with 24/7 monitoring, ensuring minimal downtime and quick issue resolution