Experienced Data Engineer skilled in designing and implementing data pipelines using PySpark and Python, focusing on ETL processes, data transformation, and optimization within the Hadoop ecosystem. Strong foundation in distributed computing and cluster management, with expertise in orchestrating data pipelines using Apache Airflow for efficient scheduling, monitoring, and automation across diverse workflows. Proficient in writing complex SQL queries, optimizing database performance, and ensuring data accuracy and integrity. Specializes in data processing and ETL tasks, utilizing Python packages like Pandas and PySpark for efficient data manipulation and analysis.
Overview
5
5
years of professional experience
Work History
Data Engineer
ANZ
03.2024 - Current
Designed and implemented end-to-end data pipelines on Apache AirFlow, leveraging dynamic Dags and Tasks through Python to enhance automation and workflow efficiency.
Generated Tax reports for ANZ Tax Department by developing Spark jobs and Python scripts (Pandas, File Operation etc.) to validate, transform, and aggregate customer data, ensuring compliance with regulatory requirements.
Orchestrated data loading into the Business Landing Zone by extracting data from SQL Server, Oracle databases, and flat files (CSV, XML), then processing it using Python libraries (Ibis, DuckDB, PySpark) for cleansing and standardization.
Created Term Deposit reports by ingesting raw customer data, applying business logic (e.g., interest calculations), and transforming results into XML formats for downstream business reporting.
Developed and maintained ETL processes using PySpark for large-scale data processing, optimizing performance and scalability.
I conducted a Proof of Concept (POC) using Databricks to replace SAS for data processing and analytics. This transition aimed to leverage Databricks' scalability, collaborative environment, and advanced data processing capabilities, enhancing performance and streamlining workflows while reducing dependency on legacy systems like SAS.
Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.
Designed and managed Hadoop clusters for distributed data processing and storage, ensuring data availability and reliability.
Created and optimized complex SQL queries for data extraction, transformation, and loading (ETL) operations.
Managed and optimized ETL and CI/CD pipelines, ensuring seamless data processing and automated deployment workflows.
Created framework to generate mock data using faker for source system to reduce dependency on test data.
Optimized SQL queries and ETL jobs, reducing execution time by 25% through query plan analysis and indexing. Implemented CI/CD pipelines in Azure DevOps for automated deployment of data workflows, reducing manual errors by 90%.
Integrated DLT Hub with Airflow to automate data ingestion and transformation pipelines, ensuring data consistency across 10+ sources.
Standardized data validation frameworks using PySpark and unit testing libraries, improving data quality verification across critical financial datasets for regulatory and business reporting
Proven ability to learn quickly and adapt to new situations.
Data Engineer
Tata Consultancy Services(TCS)
09.2020 - 03.2024
Designed and implemented data processing workflows using Apache Spark, resulting in increase in processing speed and reduction in resource utilization.
Developed and maintained documentation for data pipelines, data models, and ETL/ELT processes, ensuring process traceability and reduction in documentation errors.
Collaborated with Autosys to schedule and monitor batch jobs, achieving on-time job completion rate and data accuracy.
Utilized Hive to manage and analyze large datasets, developing and optimizing queries, resulting in query performance and reduction in query execution time.
Optimized spark jobs and Pyspark Scripts for improved performance, addressing bottle necks and enhancing overall system efficiency.
Collaborated with stakeholders to define data requirements and design data models, leading to increase in business process efficiency and reduction in data-related issues.
Created and optimized complex SQL queries for data extraction, transformation, and loading (ETL) operations.
Proficient and experienced in utilizing Abinitio tools for data integration and transformation, with a focus on Express>it, Control Center, and Business Rules Environment (BRE) mapping.
Experienced in working with tools like Putty, Teamcity, Zeppelin, XL release, Intellij, Git, Jira and Confluence.