Summary
Overview
Work History
Education
Skills
Websites
Certification
Projects
Timeline
Generic

Naveen Alakunta

Hyderabad

Summary

Performance-focused Site Reliability and DevOps Engineer with over 3 years of experience in automating, optimizing, and ensuring the reliability of large-scale, real-time payment systems at the National Payments Corporation of India (NPCI), and driving application resiliency through Chaos Engineering. Adept at managing critical infrastructures like NETC, RuPay, NFS, and WCBDC, which handle millions of transactions daily. Focused on automation. Skilled in building GitLab CI/CD pipelines and scripting for log analysis and system integration. Strong hands-on knowledge of Git, GitLab, Jenkins, Docker, K8s, Terraform, and Chaos Engineering practices. Hands-on expertise in AWS services, harnessing chaos, SLI/SLO/SLA measurement, and cloud-native monitoring tools. Proven ability to design and deploy high-availability, observable systems in AWS and hybrid cloud environments. Strong foundation in performance engineering, known for a growth mindset, strong debugging skills, and a passion for delivering reliable, automated, and scalable DevOps solutions.

Overview

3
3
years of professional experience
1
1
Certification

Work History

Senior Associate - Site Reliability Engineer (SRE) | DevOps

NPCI
04.2024 - Current

National Payments Corporation of India (NPCI) - www.npci.org.in

  • Provisioned and managed highly available Kubernetes clusters using HA-Proxy to support fault-tolerant internal applications.
  • Installed and maintained the Harness application on Kubernetes using Helm, ensuring application resilience and rollback capabilities.
  • Designed and implemented backup and disaster recovery (DR) solutions using Velero and MinIO, validating restore workflows.
  • Automated infrastructure provisioning using Terraform, managing AWS resources including VPCs, subnets, EKS, EC2, IAM, RDS, EFS, and S3.
  • Managed core AWS services such as: Compute: EC2, Lambda, EKS, ECS; Storage: S3, EBS, EFS; Networking: VPC, Route 53, Elastic Load Balancers, DNS; Database: RDS, DynamoDB; Identity & Access Management: IAM policies, roles; Elasticity: Auto Scaling Groups, Load Balancing.
  • Worked extensively with Kafka, Redis, KeyDB, and Cassandra to understand distributed data flows and ensure fault-tolerant infrastructure.
  • Collaborated with cross-functional teams to enhance observability, automation, and reliability across UPI services contributing to a 20% improvement in system stability metrics.
  • National Payments Corporation of India (NPCI) - www.npci.org.in

CI/CD, Automation & Tooling

NPCI
06.2022 - 03.2024
  • Maintained CI/CD pipelines with GitLab CI, Jenkins, GitHub Actions, and ArgoCD for efficient Docker builds and Kubernetes deployment.
  • Automated repetitive tasks through Ansible playbooks, ensuring streamlined password rotation and environment setup.
  • Implemented infrastructure updates and monitoring across various platforms to improve operational consistency.
  • Conducted resilience assessments for key financial systems such as RuPay, focusing on fault tolerance.
  • Utilized chaos experimentation techniques to identify weaknesses and enhance system reliability.
  • Simulated real-world failure scenarios on Kubernetes clusters to test system robustness.
  • Managed controlled disruptions to track performance metrics and logs during stress testing.
  • Achieved a ~35% improvement in fault detection efficiency, leading to reduced mean time to recovery.

Cloud and Virtualization

NPCI
06.2022 - 03.2024
  • Provisioned and managed infrastructure as code (IaC) with Terraform to deploy AWS infrastructure across multiple environments, reducing manual errors by approximately 50%.
  • Configured auto-scaling groups and load balancers to achieve 99.9% application availability and ensure fault-tolerant delivery.
  • Designed secure VPC architectures utilizing public/private subnets, NAT gateways, and custom route tables, enhancing network isolation and security posture by around 40%.

Docker & Kubernetes and Cluster Management

NPCI
06.2022 - 03.2024
  • Designed and deployed highly available Kubernetes clusters on AWS EKS and on-prem VMware.
  • Integrated Prometheus and Grafana to enhance observability, reducing troubleshooting time by approximately 35%.
  • Implemented RBAC, resource quotas, and network policies for effective Kubernetes cluster management.
  • Customized Helm chart structures to improve configuration reusability and decrease deployment time by approximately 30%.
  • Wrote and optimized Dockerfiles for efficient application containerization, enhancing scalability.
  • Configured HAProxy and Ingress-NGINX for high availability and intelligent traffic routing in Kubernetes applications.
  • Developed a two-site backup and migration strategy using MinIO and Velero for disaster recovery.
  • Ensured minimal downtime during site migrations, increasing backup reliability for critical workloads.

Cloud Automation & Scripting

NPCI
06.2022 - 03.2024
  • Managed AWS cloud resources using Terraform, ensuring consistent, scalable infrastructure deployments.
  • Designed modular Terraform code structures for environment-specific configurations.
  • Secured Terraform state management with AWS S3 and DynamoDB to prevent race conditions.
  • Developed Ansible playbooks for automating system configurations and application deployments.
  • Created custom shell scripts to automate server connectivity setup, enhancing onboarding efficiency.
  • Automated log simplification workflows, integrating scripts into CI/CD pipelines with GitLab and Jenkins.
  • Streamlined deployment processes, resulting in faster feedback cycles and improved accuracy.

Education

Bachelor of Technology - Electrical and Electronics Engineering

samskruthi College of Engineering
Hyderabad
01.2021

Skills

  • Linux and Windows
  • Cloud services (AWS)
  • Containerization (Docker, Kubernetes)
  • Infrastructure as code (Terraform, Ansible)
  • Monitoring tools (Prometheus, Grafana)
  • Continuous integration (Jenkins, GitLab, GitHub Actions)
  • Version control (Git, GitHub)
  • Database management (MariaDB, Redis, KeyDB, Cassandra

Certification

  • Harness certified Chaos Engineering Developer, Harness, 2024
  • Wizard Winner - NPCI playground Challenge (Advanced Open Source Configuration), 2024

Projects

Harness chaos engineering:

  • Executed CPU hog, memory hog, and pod delete scenarios on critical workloads (Kafka, MariaDB, KeyDB) to test pod-level fault resilience and failover
  • Simulated I/O stress on application mount paths; identified disk pressure issues and improved disk health monitoring strategies.,
  • Performed pod and node-level resource saturation (CPU/memory) to test throttling and rescheduling
  • Increased node fault tolerance awareness by 35%, injected network latency and packet corruption to test inter-service reliability
  • Improved probe recovery time and automated restart validation boosted overall application resilience by 40% through proactive chaos experimentation and enhanced failover testing

Kubernetes administration

  • Deployed production-grade Kubernetes clusters on AWS (EKS) and on-prem (RKE); implemented best practices in networking, security, and high availability to support microservices architecture and CI/CD workflows, Prometheus, and Grafana setup
  • Configured Prometheus for metrics scraping and storage, ensuring efficient data collection from microservices, designed custom Grafana dashboards, and monitoring alerts for real-time system and container monitoring, with tailored queries

Timeline

Senior Associate - Site Reliability Engineer (SRE) | DevOps

NPCI
04.2024 - Current

CI/CD, Automation & Tooling

NPCI
06.2022 - 03.2024

Cloud and Virtualization

NPCI
06.2022 - 03.2024

Docker & Kubernetes and Cluster Management

NPCI
06.2022 - 03.2024

Cloud Automation & Scripting

NPCI
06.2022 - 03.2024

Bachelor of Technology - Electrical and Electronics Engineering

samskruthi College of Engineering
Naveen Alakunta