I am a Senior Data Engineer with 5.5 years of experience in ETL and ELT processes, and building Big Data systems to provide Unified Analytics Platforms (Batch & Streaming), primarily focusing on analytics. My expertise includes Python, PySpark, Spark, AWS, SQL, Kubernetes, and Docker, with hands-on experience in Databricks, Azure, HDFS, Hadoop, and Hive. I specialize in developing efficient data pipelines and processing large-scale datasets.
I have played key roles in projects like Gupshup, contributing to analytics development, and Purple Finance, managing ETL processes. My work is centered on automation, optimization, and driving data-driven insights.
Description: Automated ETL pipeline for extracting data from MySQL (LMS, LOS), transforming it, and loading into Amazon S3 and PostgreSQL. Implemented data validation, materialized views for BI tools, and integrated Kafka for real-time data processing.
Responsibilities:
Tech Stack: AWS Glue, Amazon S3, PostgreSQL, Kafka, MySQL.
Description:
Led the development and optimization of data pipelines and materialized views within Gupshup’s analytics team, providing an analytics platform for batch and streaming data. Focused on SQL performance tuning in PostgreSQL and created custom scripts for system monitoring. Contributed to establishing a scalable data warehouse using Amazon Redshift and Redshift Spectrum, facilitating seamless querying of S3-stored data. Utilized Kubernetes for managing containerized applications, ensuring efficient deployment and scaling. Integrated multiple AWS services for real-time data streaming and processing.
Responsibilities:
• Architected and optimized ETL pipelines using PostgreSQL, Flink, PySpark, and Python to improve data processing efficiency for batch and streaming workloads.
• Developed a robust data warehouse infrastructure on Amazon Redshift, leveraging Redshift Spectrum for handling large datasets stored in S3, supporting analytics needs.
• Implemented custom monitoring solutions for pipeline reliability, ensuring high availability and performance.
• Leveraged Kubernetes for the automation of deployment, scaling, and management of containerized applications, enhancing operational workflows.
• Managed real-time data ingestion and processing with AWS Kinesis, ensuring timely and efficient data availability for analytics.
• Collaborated with the DevOps team to streamline automation tasks using AWS Glue, EKS, and other AWS services, enhancing the overall efficiency of the data platform.
• Worked on integrating both batch and streaming data to support various analytics use cases, providing actionable insights to business stakeholders.
• Contributed to designing and developing data models, materialized views, and dashboards to present data insights clearly.
Tech Stack: PostgreSQL, Flink, PySpark, Python, AWS S3, AWS Redshift, AWS ECR, AWS EKS, AWS MKS, AWS Kinesis, AWS IAM, AWS Glue.
Description:
The Harappa application stores data in MongoDB Atlas, which requires cleaning and filtering. The client's requirement was to transform and store this data in RDS using a star schema format.
Responsibilities:
• Developed a data pipeline to extract data from MongoDB Atlas and load it into RDS in a star schema format.
• Implemented a trigger in MongoDB Atlas to incrementally load data to S3.
• Created a Python script to load historical data from MongoDB Atlas to S3.
• Used AWS Glue to transfer data from S3 to RDS, configuring crawlers, jobs, and workflows.
◦ Developed a PySpark job to clean and load data from S3 to RDS.
◦ Set up a daily workflow for automated data processing.
• Designed RDS tables using a star schema (fact and dimension tables).
• Created SQL triggers in RDS to automate data updates in the main table.
Python programming
undefined