Summary
Overview
Work History
Education
Skills
Timeline
Generic

Sahil Goswami

Kurukshetra

Summary

Experienced SRE and DevOps Engineer with 8+ years in IT, specializing in AWS, GCP, and Azure cloud, CI/CD, version control, containerization, automation, cloud migration, application security, and monitoring. Proficient in designing and managing DevOps operations in Linux and AWS environments. Skilled in infrastructure development and service deployment using AWS services such as VPC, ECS, S3, EBS, Route 53, DNS, IAM, EC2, CloudWatch, RDS, Security Groups, and ASG.

Hands-on experience integrating critical monitoring tools like Zabbix, ManageEngine, Grafana, CloudWatch, New Relic, ELK, Splunk, Sumo Logic, and Scalyr. Strong expertise in GitLab Runner pipelines, Docker containerization, and automation, including creating Dockerfiles, managing networks, and deploying applications across multiple environments. Well-versed in SRE best practices, including application onboarding, and implementing SLIs, SLOs, and SLAs to ensure system reliability and performance.

Overview

9
9
years of professional experience

Work History

Lead Observability Engineer

CloudEQ Pvt Ltd
Chandigarh
12.2024 - 01.2025
  • Oversaw the development and delivery of client-focused observability tools by a team of 26 engineers.
  • Conducted pre-sale activities to onboard clients for security and compliance monitoring.
  • Maintained hours tracking records to ensure accurate client documentation.
  • Spearheaded establishment of the ACC(Application Control Center) to monitor diverse application metrics efficiently.

Site Reliability Engineer-2

Chegg.com
New Delhi
02.2024 - 11.2024
  • Directed multi-region Site Reliability Engineering (SRE) initiatives, overseeing 5,000+ virtual servers and various AWS services like RDS, Elastic Cache, OpenSearch, and Redshift.
  • DB management by enforcing MySQL best practices, RDS MySQL version upgrade, and migration from 5.7 to 8.0.
  • DB migration within VPCs through setting up DMS jobs.
  • Developed and maintained reusable Terraform modules.
  • Hands-on experience in GitLab Runner pipeline.
  • Administered Error Budgets to enhance team performance.
  • Streamlined debug processes across multiple technical stacks to address production challenges.
  • Reducing TOIL (time off in lieu) by automating and developing a strategy for common issues using Bash scripts.
  • Chaos engineering adoption and management via the Gremlin tool.
  • Incident management led incident response efforts to ensure immediate resolution, conducted root cause analysis (RCA), and implemented preventive measures to avoid future occurrences.
  • Maintained the Security Group and IAM roles' access.
  • Fixing live site issues related to infrastructure, responding to incidents, redirecting those incidents to the concerned team, and following up until the resolution of incidents.
  • Working on building a monitoring framework to monitor logs, performance, and health for applications hosted on the Beacon platform using DevOps tools and custom scripts.
  • Management and integration of various monitoring tools, like Zabbix, Grafana, New Relic, and AWS CloudWatch.
  • Leading scrum meetings and sprint planning sessions to allocate tasks effectively across the SRE team.

Senior Site Reliability Engineer-2

MakeMyTrip
Gurugram
04.2016 - 02.2024
  • Expertise in creating, managing, and improving cloud infrastructure and services for VPC, ELB, NAT Gateways, EC2, EBS, EKS, S3, CloudWatch, Route 53, IAM, and Auto Scaling.
  • Experience with the Linux environment, including installing services and user management, proven experience with configuration management, monitoring tools, troubleshooting tools, and maintenance of cloud and DevOps environments, with continuous integration and continuous delivery processes (CI/CD), following Agile methodology and the Software Development Life Cycle (SDLC).
  • SLA, SLI, and SLOs were defined with the help of the Dev team, and metrics were maintained to measure each of them from an SRE perspective.
  • Expertise in Docker, Kubernetes, Shell Scripting, and monitoring tools (Grafana, Zabbix, CloudWatch).
  • Provisioned hundreds of microservices in the production environment (ECS, EKS).
  • Optimization of Docker images to reduce the build size by almost 80%.
  • Created bash scripts to automate various daily basis reports.
  • Fixing live site issues related to infrastructure, responding to incidents, redirecting those incidents to the concerned team, and following up until the resolution of incidents.
  • Working on building a monitoring framework to monitor logs, performance, and health for applications hosted on the Beacon platform using DevOps tools and custom scripts.
  • Management and integration of various monitoring tools, like Zabbix, Grafana, New Relic, and AWS CloudWatch.
  • Kubernetes cluster management was taken care of, along with maintaining the PODs.
  • Implemented cost optimization strategies to reduce and keep overall infrastructure costs at an optimal level.
  • Managing escalation processes and team roster.
  • Coordinating with cross-functional teams to understand the requirements and help with a robust monitoring setup around the project.
  • Providing 24*7 support members will work on a rotational basis and take care of live site issues.
  • Keeping track of live site incidents through the incident management tool, handling monthly incident management review reports, and daily cost reports.

Education

Bachelor of Technology - Computer Science and Engineering

01.2015

12th - PCM (Non-Medical)

Tagore Public School
Pehowa

Skills

  • Cloud infrastructure (AWS, GCP, Azure)
  • Scripting (Bash, Python)
  • Container orchestration (Docker, ECS, EKS)
  • Incident Management
  • Operations Management
  • Cost Management
  • Jenkins, GitLab
  • New Relic, Datadog
  • Zabbix
  • Grafana
  • Sumo Logic, Scalyr
  • CloudWatch
  • Splunk
  • ELK
  • JIRA, Confluence
  • Ansible
  • Terraform
  • Scrum facilitation
  • Cross-functional collaboration
  • Monitoring tools integration

Timeline

Lead Observability Engineer

CloudEQ Pvt Ltd
12.2024 - 01.2025

Site Reliability Engineer-2

Chegg.com
02.2024 - 11.2024

Senior Site Reliability Engineer-2

MakeMyTrip
04.2016 - 02.2024

Bachelor of Technology - Computer Science and Engineering

12th - PCM (Non-Medical)

Tagore Public School
Sahil Goswami