

Senior Data Engineer with nearly eight years of experience designing and delivering scalable
data platforms, ETL pipelines, and analytics systems across cloud environments. Strong
expertise in Python, SQL, PySpark, AWS (S3, Redshift , EMR, EC2), data modeling, and
distributed processing. Proven ability to build high-throughput ingestion pipelines, optimize
warehouse performance, and deliver reliable data products, powering APIs and business
analytics. Experienced in end-to-end ownership, production reliability, stakeholder
communication with senior leadership, and mentoring engineers.
Project: Sony Charts – Global Music Charts Data Platform.
Designed ingestion pipelines that consume data from 10+ third-party partners (Spotify, Apple
Music, Amazon, YouTube, Deezer, KKBOX) via APIs and web sources, processing millions of
records monthly.
•
Built automated ingestion workflows using Python and Pandas, standardizing raw feeds into
Parquet-based S3 data lake zones.
•
Implemented distributed transformations using PySpark on EMR Serverless for schema
validation, normalization, enrichment, and quality enforcement.
•
Enabled analytics consumption through external tables and automated metadata
crawling, loading curated datasets into Redshift fact and dimension models.
•
Developed a multi-stage entity resolution and product matching pipeline using PySpark,
matching chart tracks against Sony's master catalog using hierarchical matching strategies.
•
Improved match accuracy and reduced manual reconciliation by 30–40% through optimized
matching logic and validation rules
Programming languages: Python, SQL,
Bash
Big data processing: PySpark, Apache
Spark, EMR Serverless
Cloud services: AWS (S3, Redshift , EC2, RDS,
Load Balancer, Glue)
Data engineering: ETL/ELT, Data modeling
(Star/Snowflake), Data lakes
Orchestration and reliability: Scheduling,
retries, monitoring
APIs and systems: REST APIs,
Authentication, Load balancing
Databases: PostgreSQL, Redshift , MySQL
DevOps tools: Docker, Git/GitHub, CI/CD
basics
Analytics tools: Pandas, NumPy, SQL
optimization