Pentaho Deduction File Framework (FDRE)
- Spearheaded the design and development of the FDRE framework, automating deduction file generation across 10+ client-specific formats.
- Reduced file creation time by 40–50% and manual effort by 60%, significantly improving delivery speed and scalability of client services.
Scala to PySpark Job Migration
- Led the migration of a legacy Scala-based batch job to a more efficient and maintainable PySparkimplementation.
- Improved job performance by 35%, reduced runtime failures by 30%, and increased overall data processing capacity.
- Enhanced debuggability and execution control, resulting in faster issue resolution and improved operational stability.
Tools: PySpark, Elasticsearch, DSL Queries, Greenplum.
Enterprise Data Pipelines
- Designed, developed, and maintained end-to-end data pipelines ingesting and transforming multi-terabyte datasets daily from multiple sources into a centralized data warehouse.
- Ensured 99.9% pipeline reliability, scalability, and high data quality to support downstream analytics and enterprise reporting.
Tools: Kafka, PySpark, Elasticsearch, Greenplum.
Data Dictionary Automation
- Developed an automated data dictionary framework using Sphinx, sourcing metadata from Greenplum for 100+ tables.
- Integrated the solution with EMR and orchestrated deployments using Airflow, delivering a dynamic HTML documentation portal.
- Reduced analyst dependency and data discovery time by 25–30% for data consumers.
Adobe Data Delta Lake Pipeline
- Built a scalable ingestion pipeline to load raw Adobe datasets into a Delta Lake architecture.
- Enabled actionable insights into client visits, login behavior, and order activity, supporting analytics use cases across multiple business teams.
Campaign Data Load Optimization
- Optimized ingestion of a 3.2 billion-record Responsys campaign dataset across multiple source tables.
- Leveraged PySpark parallel processing to reduce pipeline load time by 50%, ensuring timely availability of business-critical insights.
Airflow Platform Migration
- Led the upgrade and migration of Apache Airflow to the latest version.
- Successfully migrated 100% of production DAGs, resolving dependency and compatibility issues with zero downtime and no disruption to business workflows.
SurveyMonkey API Data Integration
- Developed a PySpark-based ingestion framework to extract and process data from SurveyMonkey APIs.
- Applied complex transformations to convert raw API responses into analytics-ready datasets, enabling scalable reporting and analysis.
- Automated ingestion pipelines to improve data freshness and reduce manual intervention.