Summary
Overview
Work History
Education
Skills
Accomplishments
Certification
Timeline
Generic

Neha Kundra

Summary

Senior Site Reliability Engineer and Infrastructure Architect with 12+ years of experience designing and implementing enterprise-scale solutions across Fortune 500 companies, including Oracle Cloud, Paytm, and Cisco. Proven expertise in leading cloud transformation initiatives, managing hybrid data centre environments, and building high-performance engineering teams. Successfully architected and maintained 99.99% uptime for mission-critical systems across 50+ data centres serving 50M+ global users while reducing operational costs by 40% through strategic automation and data centre consolidation. Deep expertise in AWS/OCI/Azure cloud platforms, on-premises data centre operations, colocation management, and AI-driven infrastructure optimisation. Recently pioneered AI-powered incident response automation using machine learning, transforming traditional on-call operations across distributed data centre environments and establishing new industry standards for autonomous system reliability.

Overview

12
12
years of professional experience
1
1
Certification

Work History

Computer Scientist II

Adobe
02.2022 - Current

• Developed AI-powered incident response automation using Python, LangGraph, OpenAI GPT-4, Flask, and Slack SDK, integrating Prometheus monitoring and Splunk SIEM for real-time alert investigation, reducing MTTR by 75% and eliminating manual oncall escalations

• Implemented infrastructure-as-code (IaC) automation using SaltStack configuration management, Jinja2 templating, and YAML state files, reducing deployment cycles by 70% and achieving zero configuration drift across 200+ production servers

• Orchestrated enterprise AWS cloud migration using EC2, RDS, ELB, Auto Scaling Groups, and CloudFormation, implementing immutable infrastructure patterns that improved system reliability by 20% and achieved 99.99% uptime SLA

• Engineered CI/CD pipeline automation using Jenkins, and GitOps workflows, enabling zero-downtime blue-green deployments across on-premises data centers and cloud reducing release cycles from 72 hours to 7 hours

• Led cloud-native transformation initiative migrating monolithic applications to microservices architecture using Docker, Kubernetes, AWS EKS, and service mesh technologies, reducing operational costs by 25% and improving horizontal scalability

Site Reliability Developer

Oracle
12.2020 - 02.2022
  • Implemented proactive monitoring and alerting systems using Prometheus and Grafana, reducing Mean Time to Recovery (MTTR) by 40%
  • Developed Python automation framework using REST APIs, Jinja2 templating, streamlined incident response workflows, configuration management tasks, and deployment processes, improving operational efficiency by 35%
  • Automated Change Management Request (workflow using Python scripting, and YAML configuration templates, reducing change request authoring time by 30% and eliminating manual data entry errors
  • Collaborated with cross-functional service teams during major incident war rooms, providing networking domain expertise for multi-service outages, facilitating technical discussions, and ensuring timely customer communication during service disruptions

Sr DevOps Engineer

Paytm
07.2018 - 12.2020

• Implemented CI/CD automation using Jenkins and GitLab for order management microservices, enabling 20+ daily releases and supporting 50,000+ concurrent users during peak shopping events

• Designed scalable cart architecture using AWS ECS, Redis clusters, and MySQL databases, processing 100,000+ orders per hour with 99.95% success rate and sub-500ms response times

• Built a monitoring platform using Prometheus, Grafana, and ELK stack for e-commerce workflows, achieving 85% proactive issue detection and reducing cart abandonment latency by 40%

• Established multi-region disaster recovery using automated failover and daily backups, maintaining 99.9% availability during high-traffic sale events

• Deployed centralized logging infrastructure using Elasticsearch cluster, Logstash data processing pipelines, Kibana visualization dashboards, and Filebeat agents across 150+ microservices, achieving near real-time log ingestion with ≤2 seconds latency

DevOps Engineer

Acquia
01.2017 - 05.2018

• Administered 200+ Linux servers (RHEL, Ubuntu, CentOS) and enterprise applications using Ansible configuration management, Puppet automation, and Nagios monitoring, maintaining 99.8% system availability across distributed data center environments

• Developed Python and Bash automation scripts for server provisioning, configuration management, and system monitoring using Terraform, Salt, and Jenkins pipelines, reducing manual deployment tasks by 75% and operational overhead by 50%

• Created comprehensive technical documentation using Confluence, GitLab wikis, and Markdown templates for system deployment procedures, troubleshooting runbooks, and incident response workflows, reducing new team member onboarding time from 3 weeks to 1 week


Linux System Administrator

Aricent
07.2015 - 12.2016

• Maintained 99.99% uptime for Cisco Customer Support Platform and Communications (CSPC) application serving 5,000+ enterprise customers, implementing proactive monitoring using Nagios, Splunk log analysis, and automated failover mechanisms to minimize service disruptions and ensure SLA compliance

• Reduced customer escalations by 25% through proactive incident management using ServiceNow ITSM, Python automation scripts, and real-time application performance monitoring, maintaining 95% customer satisfaction rating and improving first-call resolution rates

• Managed Customer Access Platform (CAP) support calls and technical incidents, resolving 90% of Severity 1-3 issues within defined SLA timeframes using ITIL best practices, root cause analysis, and cross-functional collaboration with Cisco engineering teams

System Engineer

Tech Mahindra
01.2014 - 02.2016
  • Developed and maintained custom scripts that improved system efficiency by 15% and reduced performance time by 10%
  • Provided day-to-day support to system users, training employees on troubleshooting and problem-solving, which reduced support tickets by 20%
  • Managed and monitored all installed systems, consistently achieving a 99% level of availability and meeting SLA requirements

Education

Bachelor of Technology (B.Tech.) -

Kurukshetra University

Skills

  • Cloud Platforms: AWS (EC2, S3, RDS, Lambda, Glue, Event-bridge, CloudWatch)
  • Infrastructure as Code: Terraform, SaltStack, Ansible
  • Monitoring & Observability: Prometheus, Grafana, ELK Stack, Splunk, SLOs, Error Budgets
  • CI/CD & Automation: Jenkins, GitHub Actions
  • Operating Systems: Linux (Red Hat, CentOS, Ubuntu)
  • Networking: TCP/IP, DNS, Load Balancing, SSL/TLS, HTTP/HTTPS
  • Version Control: Git, GitHub, Bitbucket
  • Coding and Automation: Python

Accomplishments

  • Reduced troubleshooting time by 65% through automated heap dump analysis with GenAI
  • Achieved 70% reduction in deployment time through infrastructure-as-code implementation
  • Created custom legacy cluster deployment that reduced release timeline by 50%
  • Decreased operational costs by 25% through cloud-native architecture migration
  • Implemented a central ELK stack for near real-time log analysis across 150 microservices with ≤2 seconds lag

Certification

  • Red Hat Certified System Administrator (RHCSA)

  • Red Hat Certified Engineer (RHCE)

Timeline

Computer Scientist II

Adobe
02.2022 - Current

Site Reliability Developer

Oracle
12.2020 - 02.2022

Sr DevOps Engineer

Paytm
07.2018 - 12.2020

DevOps Engineer

Acquia
01.2017 - 05.2018

Linux System Administrator

Aricent
07.2015 - 12.2016

System Engineer

Tech Mahindra
01.2014 - 02.2016

Bachelor of Technology (B.Tech.) -

Kurukshetra University
Neha Kundra