Summary
Overview
Work History
Education
Skills
Summary
Certification
Timeline
AssistantManager
HEMKUMAR CHHEDA

HEMKUMAR CHHEDA

Ambernath

Summary

Results-driven Airflow Reliability Engineer with over 9 years of IT experience specializing in cloud integration services. Proven expertise in managing data orchestration and integration tools, including Apache Airflow and various AWS services, complemented by certifications as a Kubernetes Administrator and in Apache Airflow fundamentals. Demonstrated ability to lead projects effectively while meeting deadlines, managing multiple tasks, and maintaining strong communication with stakeholders. Proactive team player with a solid understanding of ITIL practices, committed to continuous skill enhancement and adapting to evolving technological landscapes.

Overview

9
9
years of professional experience
1
1
Certification

Work History

Manager, Airflow Reliability Engineering

Astronomer, Inc.
12.2021 - Current

Team Size: 10

Project description:

  • Astronomer provides Airflow as a managed service to various customers distributed globally.
  • There are two types of product offerings, Astro Cloud and Astronomer Private Cloud.
  • Astro Cloud is a DataOps platform built on top of Apache Airflow, which is supported on all the major cloud providers, AWS, GCP and Azure.
  • Astronomer Private Cloud helps customers to run Airflow as a service on their own internal Kubernetes-managed infrastructure.


Responsibilities:

  • Monitoring over 500 Kubernetes clusters across all three major cloud providers: AWS, GCP, and Azure.
  • Designing and scheduling DAGs to oversee the Astro Cloud infrastructure, automating proactive alerts and informing customers if a DAG fails in production due to underlying infrastructure issues.
  • Assisting customers in solving complex use cases within their data pipelines.
  • Troubleshooting performance and DAG latency issues in Airflow.
  • Supporting customers in migrating their data pipelines to Airflow.
  • Scaling out Airflow instances by setting the necessary environment variables to their correct values.
  • Recommending the best executors with Airflow based on DAG duration and complexity.
  • Installing essential providers and Python packages on Airflow and resolving dependency conflicts.
  • Assisting customers with installing or upgrading their Airflow instances and resolving issues during these activities.
  • Optimizing compute resources at the task level if any tasks fail due to resource constraints.
  • Communicating with customers and Airflow developers regarding the status of reported bugs or issues.
  • Creating KB articles for the CRE team and public-facing knowledge bases related to new Airflow issues or custom implementations.
  • Developing DAGs to monitor and report issues with Airflow deployments, customer data plane Kubernetes clusters, and databases by integrating Airflow with Chronosphere, PagerDuty, Zendesk, Splunk, etc.
  • Leading a team of 10 members in the IST timezone to provide seamless support to Astronomer customers.
  • Creating SMART goals for each team member based on their skills, expertise, and interests.
  • Managing escalations and high-priority incidents raised by customers.
  • Coordinating on-call rotations among team members.
  • Conducting regular interviews to expand the team in response to increased global support demand.
  • Mentoring teams by organizing learning sessions and developing presentations.

Cloud Application Administrator

ZS Associates, Pune
12.2019 - 12.2021

Team Size: 3

Project description:

  • This project involves installing, upgrading, managing, and providing administrative support to project teams on integration services tools, such as Informatica and Boomi; data orchestration tools, including Apache Airflow by Astronomer and MWAA; and container orchestration platforms, including Kubernetes, EKS, and ECS.
  • Upgrade the Informatica domain and applications from older versions (9.1, 9.5, 9.5.1, 9.6.1) to the latest version (10.2).
  • Upgrade Apache Airflow, the Astronomer platform, and MWAA from earlier versions (1.10.5, 1.10.10) to the latest supported Airflow version (v2.0, v2.1).
  • Upgrade the Kubernetes cluster from earlier versions (1.16, 1.17) to the latest supported versions (1.18, 1.19, 1.20).
  • Identify the technologies and environments that will be impacted by the upgrades and communicate with stakeholders and teams.
  • Determine contacts and pre- and post-upgrade steps and checks.
  • Manage the Informatica upgrade and coordinate with support teams for other technologies.
  • Keep stakeholders informed about the upgrade status.
  • Support the client by collaborating with various teams—such as application, DBA, UNIX, and OS teams—and delivering solutions within SLA.


Responsibilities:


Astronomer Airflow + Kubernetes

  • Planning installation and upgrades of the Astronomer platform, MWAA, Apache Airflow, Kubernetes (EKS) clusters, and other dependent services.
  • Performing various upgrades and configuration changes during planned weekend maintenance activities.
  • Creating migration plans and assisting project teams in successfully migrating their DAGs from older to newer versions of Airflow.
  • Onboarding project teams on Astronomer by provisioning an Airflow instance on shared EKS clusters and providing access to requested users.
  • Integrating Bitbucket with Astronomer-managed Apache Airflow to enable versioning of code and CI/CD pipelines for continuous, smooth deployments.
  • Creating replica sets of Airflow components, like the web server, to facilitate Blue-Green deployments.
  • Creating cross-account IAM roles and VPC peering between AWS accounts to integrate various AWS services such as S3, RDS, EMR, Redshift, AWS Secrets Manager, and others with Apache Airflow instances.
  • Onboarding project teams on MWAA by provisioning new MWAA instances for each team using AWS CloudFormation templates, providing user access, and integrating AWS services with MWAA.
  • Integrating MWAA with ECS Fargate and CloudWatch to execute custom OS scripts on ECS Fargate using Airflow.
  • Creating MWAA disaster recovery (DR) plans and performing DR activities using backup and restore mechanisms for DAGs, variables, connections, and secrets in the DR region.
  • Configuring SMTP servers to enable DAG success or failure alerts on Airflow for each project team.
  • Configuring, enabling, and monitoring alerts for each Airflow instance on a cluster using tools like Prometheus, Grafana, AWS CloudWatch, and Splunk.
  • Managing Astronomer platform deployments and Airflow deployments on Kubernetes clusters such as EKS.
  • Creating daily backup scripts to export Helm deployments, variables, and connections of each Airflow instance on a cluster, and uploading them to S3.
  • Developing custom roles to enable DAG-level authorization in Airflow for different operations teams within a project.
  • Managing monthly chargebacks for Airflow instances for each project team.
  • Decommissioning Airflow instances and other AWS services integrated with Airflow.
  • Troubleshooting various performance issues in DAGs created by project teams on Airflow using the Kubernetes executor.
  • Understanding all use cases and assisting project teams in creating data pipelines on Airflow, guiding them to follow best practices for authoring and scheduling DAGs.


Informatica + Autosys

  • Planning and implementing installation and upgrade of Informatica PowerCenter tool from v9.6 to v10.1.
  • Performing upgrades and configuration changes during planned maintenance activities over weekends.
  • Creating migration plans and assisting project teams in successfully migrating their workflows from lower to higher versions of Informatica.
  • Onboarding project teams to Informatica by provisioning an Informatica repository and its associated services on a shared on-premises Linux cluster, and providing access to requested users.
  • Executing Informatica disaster recovery (DR) activities and end-to-end testing by switching all domains, repositories, and services to point to the DR region.
  • Managing monthly chargebacks for Informatica and Autosys for each project team.
  • Creating quarterly reports such as the Royalty report for Informatica and periodic access reviews.
  • Decommissioning Informatica repositories and their associated services.
  • Troubleshooting performance issues faced by project teams on Informatica by analyzing session logs and database sessions for each query.
  • Developing daily backup check scripts to verify the backups of Informatica repositories and domain backups in a cluster, and sending a consolidated report to the support team via email.

Senior Systems Engineer

Infosys Limited
12.2016 - 12.2019

Team Size: 10

Project description:

  • This project involves upgrading the Informatica domain and applications from older versions (9.1, 9.5, 9.5.1, 9.6.1) to the latest version (10.2).
  • Identifying the affected technologies and environments during the upgrade process, and communicating with stakeholders and teams.
  • Determining contacts and pre- and post-upgrade steps and checks involved in the process.
  • Managing the Informatica component during the upgrade and coordinating with support teams for other technologies.
  • Keeping stakeholders informed about the upgrade status.
  • Supporting the client by collaborating with various teams, such as application, DBA, UNIX, and OS teams, to deliver correct solutions within SLA.


Responsibilities:

  • Created Domains, Nodes, Repositories, and Integration Services, along with activities like Versioning, Decommissioning, and Restoration.
  • Upgraded Informatica repositories, installed Hotfixes, Utilities, and Patches released by Informatica Corporation.
  • Installed drivers and established connections in Workflow Manager, Developer, and Admin Console.
  • Created and maintained Informatica User profiles and privileges; set up folders, groups, users, and permissions; performed repository administration using Repository Manager and Admin Console.
  • Migrated and created Informatica Repositories, Folders, and Objects from DEV, QA/TEST, Staging, to Production environments.
  • Stopped and started services during outages.
  • Troubleshoot various issues, including long-running sessions, repository/domain/Integration service failures or hangs, and user access problems.
  • Created Informatica Mappings, Sessions, and Workflows to load data from different sources to targets.
  • Uploaded server files using FTP and SFTP.
  • Managed Object Versioning tasks like Check-in, Check-out, Undo-Checkout, viewing history, and setting object status; monitored batch job schedules.
  • Scheduled sessions to perform ETL operations based on business requirements.
  • Utilized Informatica parameter files effectively for defining mapping variables, workflow variables, FTP connections, and relational database connections.

Education

Bachelor of Engineering Technology - Electronics & Communication

KJSCE, Mumbai
Mumbai
01.2016

Skills

  • Proficient in Apache Airflow management
  • Data transformation expertise with DBT, Cosmos, and Spark
  • Proficient in Kubernetes and Docker
  • Experience with alerting, logging and monitoring systems
  • Experience with Autosys for job management
  • Experience with Informatica and Boomi platforms
  • Database management: RDS/PostgreSQL
  • Data warehouse management: Snowflake, BigQuery, Databricks
  • Experienced in Python and SQL programming

Summary

  • Working at Astronomer as Airflow Reliability Engineer since December 2021.
  • Worked at ZS Associates as Cloud Application Administrator in Pune, India, from December 2019 to December 2021.
  • Previously worked at Infosys Limited as a Senior System Engineer in Pune, India, from December 2016 to December 2019.
  • 9 years of IT experience in the Cloud Integration services domain.
  • Certified Kubernetes Administrator (CKA).
  • Achieved certification in Fundamentals of Apache Airflow by Astronomer.
  • Achieved certification in Apache Airflow DAG Authoring by Astronomer.
  • Provided administrative support for data orchestration and integration tools such as Astronomer-managed Apache Airflow (1.10.x, 2.x, 3.x), MWAA (1.10.x, 2.x), Informatica, Boomi platform, and scheduling tools like Autosys, among others.
  • Good exposure to container orchestration tools such as Kubernetes, EKS, ECS, etc., and integration with various AWS services like S3, RDS (PostgreSQL, MySQL), EMR cluster, Secrets Manager, etc. Also, experienced with various logging and alerting tools like Prometheus, Splunk, etc.
  • Ability to meet deadlines and manage multiple tasks, decisive with strong leadership qualities, flexible in work schedules, possess good communication skills, and a keen analyst and team player.
  • Strong understanding of ITIL practices.
  • Ability to work effectively in an independent environment, managing multiple priorities and keeping stakeholders informed about progress and issues.
  • Ability to continuously upgrade skills and maintain flexibility to successfully accomplish targets.

Certification

  • Certified Kubernetes Administrator (CKA)
  • Achieved Certification in Apache Airflow Fundamentals by Astronomer.
  • Achieved certification in Apache Airflow DAG Authoring by Astronomer.
  • Acquired training in Container Orchestration tools like Kubernetes, EKS, ECS at ZS Associates, Pune.
  • Acquired training in Informatica Tools like PowerCenter, Developer tool at Infosys Ltd, Pune.

Timeline

Manager, Airflow Reliability Engineering

Astronomer, Inc.
12.2021 - Current

Cloud Application Administrator

ZS Associates, Pune
12.2019 - 12.2021

Senior Systems Engineer

Infosys Limited
12.2016 - 12.2019

Bachelor of Engineering Technology - Electronics & Communication

KJSCE, Mumbai
HEMKUMAR CHHEDA