Results-driven Data Engineer with 2.5 years of whole experience in designing and maintaining scalable data pipelines and ETL workflows using Python , SQL , MongoDB , PySpark , AWS , and Snowflake . Expert in processing large-scale structured and unstructured data, ensuring high data quality, consistency, and availability. Proven track record of optimizing data workflows and collaborating with cross functional teams to deliver business focused data solutions. Strong communicator with a focus on performance, scalability, and continuous improvement.
Responsibilities:
· Designed and implemented data pipelines in Apache Spark for ingesting and transforming large-scale datasets (reader logs, sales data, reviews).
· Applied the Medallion architecture (Bronze, Silver, Gold) to structure data into raw, cleaned, and analytics-ready layers.
· Extracted text and metadata from PDFs using Apache Tika, integrating unstructured author submissions into the data pipeline.
· Ingested multi-source data: reader logs via APIs, sales transactions from SQL databases, and reviews from MongoDB.
· Optimized data storage and lifecycle management in AWS S3 with policies for cost-efficient archiving.
· Consolidated processed data into Snowflake, enabling cross-source analysis of sales, reviews, and reader engagement.
· Developed SQL queries to generate KPIs such as top-read books, drop-off points, average reading time, and sales–review correlations.
· Built interactive dashboards in Amazon QuickSight to deliver insights on sales performance, reader behavior, and author impact.
· Prepared curated datasets for data scientists in SageMaker to support recommendation systems and predictive analytics.
· Collaborated with the BI team to expose Snowflake data via JDBC connectors, ensuring real-time dashboard updates.
· Improved business decision-making by providing a 360° view of books, readers, and authors, leading to insights on marketing effectiveness and user satisfaction.
Responsibilities:
• Developed scalable PySpark pipelines to ingest and process large volumes of user activity data from CSV sources.
• Designed and implemented statistical aggregations (mean, min, max, stddev) across time dimensions—day, week, hour, and daytime.
• Generated JSON-formatted metric maps using ‘create_map' and ‘to_json' for downstream analytics and dashboarding.
• Joined multiple aggregated DataFrames using composite keys (date, dayname, daytime) to create unified behavioral views.
• Enriched datasets with calendar dimensions (year, month, day) and wrote final outputs in optimized Parquet format.
• Leveraged Spark SQL functions and best practices to ensure efficient joins, aggregations, and performance tuning.
• Contributed to a behavior analysis use case enabling trend insights across messaging patterns at different time granularities.
ETL development