Dedicated Data Engineer with 3+ years of expertise in designing and implementing robust data solutions using AWS cloud services and big data technologies. Proven track record of developing scalable data pipelines, performing complex data migrations, and optimizing data processing workflows. Demonstrated ability to transform complex data challenges into efficient, automated solutions using Python, PySpark, and AWS services.
1. Data retention and destruction
Implemented a data deletion strategy for securely managing data and case files over 10 years old This initiative supports annual data purges, ensuring compliance with data retention policies and optimizing storage.
2. L1 migration
The project involved migrating data from a Netezza database to Amazon Redshift using AWS Glue. Our main responsibility was to ensure the data was compatible with Redshift's structure. To do this, we created a configuration file that defined how data should be transferred, validated it for accuracy, and provided this as an input to the execution team. This configuration helped automate the migration process, ensuring smooth data movement and accuracy. Once validated, we used Glue to load the data into Redshift, completing the migration efficiently and securely.
3. Data Services Framework
Our project involved building a data-loading framework to move data between different layers (L1, L2, and L3) within our data pipeline. The goal was to create a flexible, reusable framework that could handle any type of incoming data. This required designing processes that automate data validation, transformation, and loading across all layers, ensuring compatibility and integrity regardless of data type. By developing this robust framework, we enabled consistent, automated data handling, which saves time and reduces errors in the data pipeline.
4. AI/ML model monitoring and explainability framework
Developed an AI/ML model monitoring framework to track data drift, feature stability, and model performance in production. Implemented PSI, feature drift checks, and SHAP explainability to identify key drivers influencing predictions and ensure transparency. Automated monitoring and reporting through Databricks pipelines, enabling continuous tracking, improved governance, and early detection of model drift
I hereby declare that the above written particulars are true and correct to the best of my knowledge and belief.