Anshuman Gupta

Gurgaon

Summary

Results-driven Data Engineer with more than 6.5 years of industry experience specializing in building scalable data pipelines on GCP. Proficient in data ingestion, validation, and transformation for retail analytics using BigQuery, Spark (Scala/Python), and Airflow. Hands-on expertise in Docker, Kubernetes (GKE), BigQuery, and CI/CD tools like TeamCity and Octopus.

Overview

years of professional experience

Work History

Senior Engineer

Dunnhumby

10.2021 - Current

For Dunnhumby's science products, Datahub provides a unified data ingestion approach.
Developed a scalable data platform for delivering Dunnhumby’s science products by integrating data from multiple retail sources into Google BigQuery.
Datahub provides inbound file validations, data quality assurance and data aggregation/transformation required to deliver science products.
Developed Spark Jobs with Scala & Python to validate incoming files and converting it into required format.
Built BigQuery-based retail data warehouse for domains including sales, promotions, customer behavior, and inventory analytics.
Built interactive dashboards in Looker Studio to track plan performance (unit sales, profit, margins) with week-level, zone, and store analysis and Analyzes historical price trends to guide pricing and promotion strategies. Retains data for up to two years.
Experience working with containerization tool Docker and deploying spark jobs on a Kubernetes cluster through Airflow.
Experience in scheduling and managing complex data flows through Airflow.
Experience with GCP and its services such as Bucket, GKE Kubernetes Cluster and Instances.
Experience in Continuous integration and deployment with TeamCity and Octopus.
Used GCP services including Cloud Storage (GCS), GKE, Compute Engine, and BigQuery for end-to-end pipeline deployment and hosting.
Performed BigQuery cost optimization by tuning SQL, leveraging partitioning/clustering, and monitoring slot usage.

Big Data Developer

Infosys Ltd.

01.2019 - 10.2021

This project served to predict the risk for the insurance company in patients classified as high-risk so that the company doesn't suffer a loss.
Developed and executed Incremental Sqoop Jobs by importing data from MySQL to HDFS using Sqoop.
Developed code in Spark Scala with Spark SQL and DataFrame utilizing performance tuning and optimization techniques.
Contributed to the continuous improvement of the big data infrastructure by monitoring system performance, identifying bottlenecks and making necessary optimizations.
Creating HBase layer with Performance Tuning to facilitate fast reporting.

Education

B.Tech - Mechanical Engineering

Pranveer Singh Institute Of Technology

Kanpur, India

06-2017

Skills

Programming Language: Scala, Python
Database: Cloud SQL, Postgres
Hadoop Ecosystem: Apache Spark, HDFS, Yarn

Containerization and Orchestration Technologies: Docker, Kubernetes, Airflow
CI/CD Tools: TeamCity, Octopus
Cloud Services: GKE Kubernetes Cluster, BigQuery, Buckets, Compute Engines

Accomplishments

Enhanced Data Processing - Improved data processing efficiency by 35% using Apache Spark.
Scalable Deployments - Deployed 50+ Spark jobs using Docker and GKE Kubernetes Cluster.
Data Ingestion Optimization - Developed 30% faster data ingestion workflows.
Data Quality Assurance - Led a team achieving 25% reduction in data errors.

Timeline