Data Governance (Client, Citibank) -
- Maintained and enhanced ETL pipelines integrating data from diverse sources including flat files, relational databases (DBMS), Kafka topics, REST APIs, and other external systems.
- Optimized Spark code to efficiently handle large-scale data processing, resulting in a 4 times improvement in system resilience and reducing time take by 70%.
- Enhanced Scala Load Utility Script to support additional source types (SQL Server, Oracle, Postgres) with robust read/write capabilities for seamless data transfer.
- Automated ad-hoc data transfer requests, reducing turnaround time by 80%.
- Created a shell script to periodically archive older partitions within Hive, guaranteeing data management and storage optimization. .
- Recently started working of spark real time streaming using Spark Structured Streams to process the real time data from Kafka Sources
Data Profiling (Client, Citibank) -
- Developed a Data Profiling tool to automate application onboarding on the ETL pipeline, reducing setup time from 4 days to minutes.
- Automated config generation, validation, and storage, with schema support for RDBMS, Kafka, and file sources.
- Reduced manual effort by 70% and improved onboarding speed by 90%, enabling 100+ applications to be onboarded efficiently.
Proof of Concepts (Xoriant) -
- Crafted a state-of-the-art schedu-lo-bot to fetch job-related information to the end user saving a significant amount of time in redundant communications.
- Implemented a Change Data Capture (CDC) stream to efficiently transfer data from multiple sources to Kafka, and subsequently to AWS S3 for long-term storage and analysis.
- Incorporated workflow scheduling and management using Airflow. Created and maintained various Directed Acyclic Graphs (DAGs) to automate tasks, resulting in improved efficiency and reduced manual errors.