USAA Project: CDP Hadoop to Snowflake Migration & Incremental Data Processing
Project: CDP Hadoop to Snowflake Migration & Incremental Data Processing
- Migrated historical data from CDP Hadoop Hive tables to Snowflake, converting Hive data to Parquet files format, staging on DM servers, and loading using SnowSQL COPY INTO commands .
- Developed PySpark-based data validation scripts to perform checksum validation, ensuring data consistency between source and Snowflake.
- Designed and implemented incremental ETL workflows for daily data loads from Datastage, MFP, and DL2 layers into Snowflake using DBT.
- Built Type 1 & Type 2 Slowly Changing Dimensions (SCD) in DBT for handling historical data and maintaining accurate dimension tables.
- Developed reusable DBT macros for SCD logic and applied across multiple models to ensure consistency and reusability.
- Processed semi-structured data (JSON) from DL2 layer and transformed into DL3 layer using DBT models for business reporting.
- Created and optimized data models in DBT aligned with business requirements, improving transformation accuracy and maintainability.
- Designed and scheduled dbt jobs using Talend batch, integrated with Control-M for automated orchestration and efficient execution.
- Automated Hadoop file transfer to Box using Python and Shell scripts, ensuring secure and reliable movement of large datasets.
Tech Stack: Hadoop, Hive, Unix(shell Scripting)Python, PySpark, Snowflake, DBT, Control-M, SnowSQL.