Senior Site Reliability Engineer and Infrastructure Architect with 12+ years of experience designing and implementing enterprise-scale solutions across Fortune 500 companies, including Oracle Cloud, Paytm, and Cisco. Proven expertise in leading cloud transformation initiatives, managing hybrid data centre environments, and building high-performance engineering teams. Successfully architected and maintained 99.99% uptime for mission-critical systems across 50+ data centres serving 50M+ global users while reducing operational costs by 40% through strategic automation and data centre consolidation. Deep expertise in AWS/OCI/Azure cloud platforms, on-premises data centre operations, colocation management, and AI-driven infrastructure optimisation. Recently pioneered AI-powered incident response automation using machine learning, transforming traditional on-call operations across distributed data centre environments and establishing new industry standards for autonomous system reliability.
• Developed AI-powered incident response automation using Python, LangGraph, OpenAI GPT-4, Flask, and Slack SDK, integrating Prometheus monitoring and Splunk SIEM for real-time alert investigation, reducing MTTR by 75% and eliminating manual oncall escalations
• Implemented infrastructure-as-code (IaC) automation using SaltStack configuration management, Jinja2 templating, and YAML state files, reducing deployment cycles by 70% and achieving zero configuration drift across 200+ production servers
• Orchestrated enterprise AWS cloud migration using EC2, RDS, ELB, Auto Scaling Groups, and CloudFormation, implementing immutable infrastructure patterns that improved system reliability by 20% and achieved 99.99% uptime SLA
• Engineered CI/CD pipeline automation using Jenkins, and GitOps workflows, enabling zero-downtime blue-green deployments across on-premises data centers and cloud reducing release cycles from 72 hours to 7 hours
• Led cloud-native transformation initiative migrating monolithic applications to microservices architecture using Docker, Kubernetes, AWS EKS, and service mesh technologies, reducing operational costs by 25% and improving horizontal scalability
• Implemented CI/CD automation using Jenkins and GitLab for order management microservices, enabling 20+ daily releases and supporting 50,000+ concurrent users during peak shopping events
• Designed scalable cart architecture using AWS ECS, Redis clusters, and MySQL databases, processing 100,000+ orders per hour with 99.95% success rate and sub-500ms response times
• Built a monitoring platform using Prometheus, Grafana, and ELK stack for e-commerce workflows, achieving 85% proactive issue detection and reducing cart abandonment latency by 40%
• Established multi-region disaster recovery using automated failover and daily backups, maintaining 99.9% availability during high-traffic sale events
• Deployed centralized logging infrastructure using Elasticsearch cluster, Logstash data processing pipelines, Kibana visualization dashboards, and Filebeat agents across 150+ microservices, achieving near real-time log ingestion with ≤2 seconds latency
• Administered 200+ Linux servers (RHEL, Ubuntu, CentOS) and enterprise applications using Ansible configuration management, Puppet automation, and Nagios monitoring, maintaining 99.8% system availability across distributed data center environments
• Developed Python and Bash automation scripts for server provisioning, configuration management, and system monitoring using Terraform, Salt, and Jenkins pipelines, reducing manual deployment tasks by 75% and operational overhead by 50%
• Created comprehensive technical documentation using Confluence, GitLab wikis, and Markdown templates for system deployment procedures, troubleshooting runbooks, and incident response workflows, reducing new team member onboarding time from 3 weeks to 1 week
• Maintained 99.99% uptime for Cisco Customer Support Platform and Communications (CSPC) application serving 5,000+ enterprise customers, implementing proactive monitoring using Nagios, Splunk log analysis, and automated failover mechanisms to minimize service disruptions and ensure SLA compliance
• Reduced customer escalations by 25% through proactive incident management using ServiceNow ITSM, Python automation scripts, and real-time application performance monitoring, maintaining 95% customer satisfaction rating and improving first-call resolution rates
• Managed Customer Access Platform (CAP) support calls and technical incidents, resolving 90% of Severity 1-3 issues within defined SLA timeframes using ITIL best practices, root cause analysis, and cross-functional collaboration with Cisco engineering teams
Red Hat Certified System Administrator (RHCSA)
Red Hat Certified Engineer (RHCE)