Summary
Overview
Work History
Education
Skills
Accomplishments
Certification
Interests
Timeline
Generic
Gagan Goswami

Gagan Goswami

DevOps And SIte Reliability Engineer
Chandigarh,Punjab

Summary

Experienced DevOps Engineer with hands on experience in architecting/automating and optimizing complex deployment over variety of large scale infrastructure. And serious about high availability and reliability with tremendous SRE and observability skills.

Overview

7
7
years of professional experience
4
4
years of post-secondary education
6
6
Certifications

Work History

DevOps & Site Reliability Engineer

nClouds, Inc.
05.2017 - Current

Starting from DevOps support engineer to SRE Manager, got promoted to various positions like Sr. DevOps Support engineer, Team Lead DevOps Support, Monitoring Product Owner, SRE resulting in various responsibilities across organization, which include but not limited to -
Designing and implementing end to end CICD pipelines utilizing Jenkins, AWS CodeBuild, CodeDeploy, Agro CD, GitHub actions to deploy to staging and production environments built on ECS, EKS, Rancher or standalone EC2.
Participation in planning for building and performing Disaster recovery and highly available and reliable systems.
Creating, maintaining, Patching and securing AWS infrastructure.
Ensuring production systems availability to maintain uptime and SLA utilizing a variety of tools to monitor, observe and secure systems.
Working with the production support team to resolve issues and performing RCAs.
Automating repetitive tasks using Lambda and SSM.
Creating and maintaining highly available and reliable infrastructure on AWS using IAC tools like Terraform and Cloudformation.
Configuration management using Ansible and Chef, including pipeline containing Jenkins, Ansible, Docker hub, Kubernetes.
Containerizing applications and migrating to ECS, Kubernetes, EKS.
Setting up Monitoring and observability using AWS Cloudwatch, X-Ray, AWS FireLens, Datadog, New Relic, Prometheus, Grafana, Splunk.
Establishing SRE culture and implementing Alert & Incident management across SRE and Support Teams.
Implementing SLO and finding appropriate SLIs as per system design and customer involvement.
Managing teams and tasks, and establishing a continuous learning and improvement environment.

System Administrator

Redduk Studio
03.2015 - 12.2016

Building and Maintaining Webservers like LAMP, nginx.

Setting up, maintaining and patching Linux servers.

Maintaining hosting platforms for customers.

Setup and maintain DNS records.

Education

Bachelor of Technology - Computer Science

Rayat And Bahra Institute of Engineering
Mohali
08.2013 - 04.2017

Skills

    AWS

undefined

Accomplishments

AWS Certifications-

  • 5x AWS Certified

Webinars-

  • How DevOps Teams Use SRE to Innovate Faster with Reliability. (In collaboration with Datadog)
  • Kubernetes on AWS: Observability. (In collaboration with AWS)

Blogs-

  • Getting started with Site Reliability Engineering
  • How to set SLO
  • Best Practices for SRE team
  • How to reduce alert fatigue?
  • Reduce MTTR using integrated runbooks.
  • Onboard 24x7 support to reduce MTTR.
  • fast-track Incident Management using dashboards.
  • How to reduce alert fatigue by reducing re-occurring incidents?
  • Accelerate incident response using Service Maps.
  • How to get started with SRE?

Certification

AWS Certified Solutions Architect - Associate

Interests

Motorcycling

Trekking

Timeline

Datadog Certified Technical Specialist

12-2020

AWS Certified DevOps Engineer - Professional

12-2019

AWS Certified Developer- Associate

09-2019

AWS Certified SysOps Administrator - Associate

02-2019

AWS Certified Solutions Architect - Professional

01-2018

DevOps & Site Reliability Engineer

nClouds, Inc.
05.2017 - Current

AWS Certified Solutions Architect - Associate

03-2017

System Administrator

Redduk Studio
03.2015 - 12.2016

Bachelor of Technology - Computer Science

Rayat And Bahra Institute of Engineering
08.2013 - 04.2017
Gagan GoswamiDevOps And SIte Reliability Engineer