Summary
Overview
Work History
Education
Skills
Timeline
Generic

Harsh Yadav

Summary

A results-driven Data Engineer with 3.6 years of experience in designing, developing, and optimizing end-to-end data pipelines and workflows. Proficient in leveraging Google Cloud Platform (GCP), including Cloud Storage, BigQuery, Dataproc, and Google Cloud Composer (Airflow), to orchestrate and automate data processing tasks. Skilled in data transformation, migration, and integration, with expertise in PySpark, Python, and ETL processes. Proven track record in optimizing data validation frameworks, reducing manual effort by 80%, and developing data profiling tools for enhanced pipeline efficiency.

Overview

4
4
years of professional experience

Work History

Data Engineer

Cognizant Technologies Solutions
07.2021 - Current

Project 1:

  • Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.
  • Refactored on-premises HQL scripts into BigQuery SQL format, applying necessary transformations for compatibility and performance.
  • Utilized Google Cloud Composer (Airflow) to orchestrate and schedule SQL and PySpark jobs, enabling efficient workflow automation.
  • Designed and implemented a Data Validation Framework using Google Cloud Composer (Airflow) to ensure data consistency in target tables, achieving an 80% reduction in manual validation efforts.
  • Developed a Data Profiler Framework using Google Cloud Composer (Airflow) to automate the generation of comprehensive dataset statistics. This framework enabled other teams to assess data quality, identify trends, and efficiently build new pipelines.
  • Continuous Integration/Deployment pipeline integration, make pull request, code reviews to deploy the pipeline using Jenkins
  • Followed Agile methodologies for iterative development and tracked efforts using Rally for efficient project management.

Project-2:

  • Led the migration of multiple workflows by designing and implementing end-to-end data pipelines. Utilized Google Cloud Composer for orchestration and integrated Databricks to trigger Spark jobs via the Databricks operator in Airflow DAGs. Delivered final outputs to Google Cloud Storage (GCS) buckets.
  • Streamlined the conversion of HQL scripts to PySpark notebooks during the migration from Hive to Databricks. Ensured seamless integration, functionality, and accuracy through comprehensive unit testing. Enhanced Airflow DAGs with email notifications and file delivery tasks for efficient downstream delivery.
  • Contributed to Continuous Integration/Deployment (CI/CD) pipelines by creating pull requests, conducting code reviews, and deploying workflow notebooks, properties files, Airflow DAGs, and shell scripts into the QEA environment.
  • Demonstrated project ownership by maintaining direct client interactions to provide regular progress updates. Followed Agile methodologies for iterative development and ensured effective effort tracking using Rally.

Education

Bachelor of Technology -

University of Petroleum And Energy Studies
Dehradun, India
05-2021

Skills

  • Cloud Platforms: Google Cloud Platform (GCP) – Storage, Composer, BigQuery, Dataproc
  • Data Engineering & Warehousing: Data Warehousing, Data Pipeline Design, Data Integration
  • Orchestration & Workflow Tools: Google Cloud Composer (Airflow)
  • Programming Languages: SQl,Python, PySpark
  • Big Data Technologies: BigQuery, Hive, Spark
  • ETL & Data Processing: PySpark, Dataproc, Airflow
  • Development Methodologies: Agile (Scrum), CI/CD (Jenkins, Git)

Timeline

Data Engineer

Cognizant Technologies Solutions
07.2021 - Current

Bachelor of Technology -

University of Petroleum And Energy Studies
Harsh Yadav