I am a Software Engineer with 5 years of extensive experience in Big Data, specializing in the Hadoop ecosystem, including Pyspark, Python, SQL, HDFS, Hive, Impala, Sqoop, and Oozie. My expertise spans key verticals like BFSI and pharmaceutical industries.
Developed a comprehensive data pipeline using PySpark in Azure Databricks to streamline the process of loading, transforming, and managing data. The pipeline was designed to load CSV file data into Delta tables, perform member matching to identify active records, implement Change Data Capture (CDC) techniques, and finally, load the refined data into SQL Server for downstream analytics and use cases.
Roles and Responsibilities:
Pipeline Development: Designed and implemented a robust data pipeline using PySpark in Azure Databricks to handle large-scale data processing tasks.
Data Ingestion: Loaded CSV file data into Delta tables, ensuring data consistency, reliability, and optimized storage.
Data Transformation: Conducted member matching processes to filter and extract active member records, ensuring data relevance and accuracy.
Change Data Capture (CDC): Applied CDC techniques to Delta tables to efficiently track and manage incremental data changes, maintaining an up-to-date dataset.
Data Integration: Loaded the processed and refined data into SQL Server, enabling seamless integration for downstream analytics and business intelligence use cases.
Optimization and Performance Tuning: Ensured the pipeline was optimized for performance, handling large datasets efficiently and minimizing processing time.
Collaboration: Worked closely with data analysts, engineers, and stakeholders to understand requirements and deliver solutions that meet business needs.