At Gojoko Technologies India Private Limited, I spearheaded data engineering initiatives, optimizing ETL workflows and enhancing data warehouse designs for scalability and performance. My expertise in Python and collaborative approach with cross-functional teams significantly improved data quality and governance, aligning with GDPR standards and driving data-driven decision-making.
CASE STUDY: Data Quality and Processing Solution – Python, MySQL, Airflow and AWS (S3, CloudWatch, Secrets Manager, Parameter Store, SNS)
Project: New Blacklist Rules Implementation – Python and MySQL
Project: New Greylist Rules Implementation – Python and MySQL
GDPR Compliance Implementation – MySQL and MsExcel
AWS Glue Pipelines – PySpark, MySQL, Airflow and AWS (S3, Glue, CloudWatch, Secrets Manger, Parameter Store, IAM, SNS)
Airflow DAGs Migration and Custom ETL Framework Development – Python, Docker, AWS (EC2, Parameter Store, Secrets Manager, SNS, IAM)
Query Runner Framework Development – Python, MySQL, Data Modelling, ORMs, Airflow, AWS (EC2, Secrets Manager, Parameter Store, IAM, SNS, CloudWatch)
Project: Data Warehouse (DWH) Project – PySpark, Databricks (SQL, Workflows, Dashboard), Data Modelling, CI/CD.
· Fulfilled ongoing requirements and enhancements in the Stage layer to preprocess and ingest raw data from diverse sources, ensuring efficient Extract, Transform, Load (ETL) processes.
· Developed and managed the Vault layer for storing, integrating, and maintaining historical and transactional data using Data Vault 2.0 methodology, enabling scalable and auditable data storage.
· Optimized the Mart layer to deliver business-specific data models, facilitating OLAP (Online Analytical Processing) for advanced reporting and analytics.
· Collaborated with cross-functional teams including data architects, developers, and business analysts to ensure data consistency, accuracy, and conformance with data governance policies.
· Employed continuous integration and continuous deployment (CI/CD) pipelines for automated testing, deployment, and monitoring, ensuring high-quality deliverables and rapid iterations.
· Utilized PySpark and Databricks for developing and orchestrating data workflows, creating interactive dashboards for real-time data visualization and insights.
· Documented ETL processes, data models, and architectural changes to maintain up-to-date records, supporting project continuity and knowledge transfer.
Project: DataLake House [DWH 2.0 Enhancement, Hourly Mart Refresh, Data Quality Improvement] – PySpark, Databricks (SQL, Workflows, Dashboard), Data Modelling, CI/CD.
· Upgraded the Data Warehouse (DWH) to DataLake House architecture, enabling a seamless integration of structured and unstructured data.
· Implemented enhancements to ensure the Mart layer refreshes on an hourly basis, improving the timeliness and accuracy of business intelligence reports.
· Identified and resolved data gaps within the platform, ensuring data integrity and consistency.
· Collaborated with data architects and engineers to design and implement scalable data pipelines for efficient data ingestion and processing.
· Utilized PySpark, Databricks, and other big data technologies to streamline data workflows and optimize performance.
· Monitored and maintained data quality, applying data governance practices to uphold data accuracy and reliability.
· Documented data processes, architectural changes, and troubleshooting guides to support ongoing project development and maintenance.