Aditya Kumar

Hyderabad

Summary

Senior Site Reliability Engineer with 10+ years of experience designing, automating, and optimizing mission-critical infrastructure across multi-cloud environments (AWS, GCP). Proven expertise in Infrastructure as Code (Terraform), Kubernetes (GKE, EKS), and robust CI/CD pipelines, enabling highly available, scalable, and fault-tolerant systems. Skilled in implementing monitoring, observability, and incident response using tools like Prometheus, Grafana, Loki, Splunk, and ELK, significantly reducing downtime and improving reliability. Adept at driving DevOps culture, enhancing developer productivity, and delivering operational excellence through automation and continuous improvement.

Overview

years of professional experience

Certification

Work History

Senior Site Reliability engineer

Quotient Technology India Private Limited

Bangalore

06.2021 - Current

Experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
Provide technical leadership on large/complex systems and platform projects.
Build tooling to support the automation, management, and reliability of applicable systems.
Build and/or support release pipelines for applicable systems.
Works as part of a team to continuously evaluate, troubleshoot, and improve existing systems.
Manage the system lifecycle from design and implementation, to turn-down and decommissioning.
Write documentation for peers and stakeholders supporting applicable systems.
Work with business partners to define SLOs and SLIs and build robust monitoring solutions supporting agreed upon metrics.
Lead communications efforts regarding both system issues/activities as well as blameless post-mortems with all stakeholders.
Capable of technical deep dives into code, networking, systems, and storage with very bright, experienced engineers.
Expertise in problem solving and analyzing global scale distributed systems.
Logging and Monitoring experience designing, deploying and running systems like Splunk, ELK, New Relic or other APM solutions.
Design, implement, and support high-performance, highly available services and infrastructure.
Improve the efficiency and flexibility of our datacenters.
Build and maintain models for growth and capacity planning.
Deployment, support and monitoring of new platforms and application stacks.
Participate in 24x7 on-call rotation to achieve desired SLAs.

Senior Site Reliability engineer

McAfee Software India Pvt Ltd

Bangalore

11.2016 - 06.2021

Involved in gathering the requirements from Client.
Experience in leading a team of engineers with demonstrated coaching and mentoring skills.
Create, edit, and maintain ad-hoc scripts to resolve issues quickly with minimal user impact.
An understanding of application performance tuning and resource usage.
Identifying, gathering, analyzing, and automating responses to key performance metrics, logs, and alerts.
Engineering solutions in the long term to make everyone’s life easier.
Familiarity with a Linux or UNIX runtime environment.
Work with product operations team to resolve trouble tickets, developing and running scripts, and troubleshooting services in a hosted environment.
Working knowledge of virtualized environments; VM management and provisioning.
Provide technical insight on development projects.
Assist with testing and validating production applications.
Working on a team employing standardized project delivery.
Assist in the Development Priority List process working with Product Management group to address issue identified as part of Problem Management.
Provide solutions for performance management, disaster recovery, monitoring and access management.
Work/support business users to understand issues, develop root cause analysis and work with the team for the development of enhancements/fixes.
Works with the team to develop, maintain, and communicate current development schedules, timelines, and development status.
Provide engineering design across different workloads including incident & problem management, change management, security, and compliance.
Improve security and performance of infrastructure by working with other teams.
Environment: EC2, Auto Scaling, route53, S3, IAM, RDS, ELK, Cloud Formation, CloudWatch, Jenkins, Ansible, Docker, AppDynamics, MoogSoft, SiteScope, shell scripts, VMware.

Cloud Infrastructure Engineer

Tata Consultancy Services

Hyderabad

11.2013 - 10.2017

Worked on monitoring tools to configure Nagios, Splunk.
Managed AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing, and Glacier for our QA and UAT environments, as well as infrastructure servers for GIT.
Computed AMI virtual machines on Elastic Compute Cloud (EC2).
Regularly updated all software and operating systems on the instances running in EC2, to eliminate security loopholes.
Build and Release Management - GIT, Jenkins administration.
Install, configure, and administer web and application servers.
Deploying web applications in non-production and production environments.
Creating and updating the JDBC data source at the application server end.
Handling different issues daily related to connection issues, space issues, connection pool issues, Apache issues, JVM, and OS-related issues.
Performance tuning of application server services, like JDBC connection pools, execute threads, and JVM parameters.
Configuring cron jobs for executing scheduled tasks and enabling or disabling them as per the requirement.
Managing and monitoring Oracle vendor-based applications using Oracle Enterprise Manager (Fission middleware control).
Integrating application servers with other middleware products, like Apache Webserver.
Configure monitors on SiteScope, respond to alerts, and perform corrective action as necessary.
Website redirection configurations (either temporary or permanent redirection).
Generate and install an SSL certificate to secure the website.
Collaborated with the development team to troubleshoot and resolve issues.
Worked on configuring the Apache Tomcat for Java applications deployment.
Manage AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing, and Glacier for our QA and UAT environments, as well as infrastructure servers for Git.
Maintained user accounts and groups, setting up user environments in Linux.
Environment: RHEL, RPM, Nagios, Splunk, GIT, kernel, Apache, Apache Web server, Tomcat, RDBMS (MySQL), No SQl (Cassandra), Apache httpd, JBOSS app server 4.x, AWS Code deploy, AWS RDS.

Education

Bachelor’s Degree - Electronics and Communications Engineering

JNTUH

Hyderabad

06-2013

Skills

Cloud platforms: AWS, GCP

Containerization and orchestration: Kubernetes, Docker, Helm
Infrastructure as code: Terraform, Ansible
CI/CD and version control: Jenkins, GitHub, Git
Monitoring and observability: Grafana, Prometheus, Datadog

Scripting and automation: Python, Bash
Security and quality assurance: HashiCorp Vault, SonarQube
Networking and systems management: Network administration, virtualization technologies
Core competencies: Root cause analysis, problem solving

Certification

AWS Certified Solutions Architect

Timeline

Senior Site Reliability engineer

Quotient Technology India Private Limited

06.2021 - Current

Senior Site Reliability engineer

McAfee Software India Pvt Ltd

11.2016 - 06.2021

Cloud Infrastructure Engineer

Tata Consultancy Services

11.2013 - 10.2017

Bachelor’s Degree - Electronics and Communications Engineering

JNTUH