Google Cloud Certified Professional Data Engineer with over 5 years of hands-on experience in architecting and optimizing scalable data pipelines on the Google Cloud Platform. Proficient in Python, SQL, PySpark, and Big Data technologies, like Apache Spark and Hive. Adept at leveraging GCP services such as BigQuery, Dataproc, Cloud Composer (Airflow), Dataform, and Google Cloud Storage to build end-to-end data workflows that are robust, cost-efficient, and performance-tuned.
Skilled in transforming raw data into actionable insights through rigorous data validation, ETL optimization, and workflow automation. Proven ability to ensure data integrity, enhance pipeline reliability, and improve system performance. Experienced in collaborating with cross-functional teams to align technical solutions with business goals, and drive data-driven decision-making across the organization.
Automated Data Workflows – Used Apache Airflow to schedule and manage ingestion, transformation, and aggregation jobs, ensuring timely data availability.
Performed data validation and quality checks – Implemented null checks, row counts, and validation rules to detect and handle data issues early.
Built Spark jobs for data transformation – Processed raw data in Dataproc using Spark, optimizing performance with repartitioning, broadcasting, and caching.
Executed large-scale data processing – Joins, aggregations, and transformations while fine-tuning Spark configurations for efficiency.
Data Storage & Querying – Leveraged BigQuery partitioning, clustering, and query optimization for cost-effective and faster analytics.
Airflow DAGs for orchestration – Managed dependencies across ingestion, transformation, and aggregation stages for seamless execution.
Monitoring & Alerts – Set up GCP Cloud Monitoring to track pipeline performance, detect failures, and trigger alerts. Enhanced Pipeline Reliability – Integrated retry mechanisms, error handling, and logging in Airflow for robust execution and troubleshooting.
Optimized resource allocation – Leveraged cost-effective compute options, and fine-tuned configurations to enhance performance while maintaining efficiency.
Ensured Data Integrity & Consistency – Validated data at each stage using checksums, row counts, and sample comparisons while reconciling source and target data.
Cost-Effective Strategies – Compressed GCS storage, rationalized BigQuery queries, and monitored resource usage to reduce costs.
Collaborated with Stakeholders – Defined data requirements, shared pipeline insights, and noted designs for easy maintenance and onboarding.
Investigated and resolved pipeline issues - By analyzing logs, utilizing performance monitoring tools, identifying bottlenecks, troubleshooting failures, and implementing optimizations.
Project : Noom
Built end-to-end data pipelines – developed STS jobs to extract batch data, load it into GCS, and further into BigQuery using scheduled Dataform load jobs.
Automated Orchestration – Created individual Dataform DAGs per table, and implemented unit testing to validate pipeline integrity.
Python-Based Utilities – Designed Python scripts to dynamically extract source metadata and automate batch data ingestion into BigQuery.
Streamlined Batch Processing – Ensured reliability and scheduling of batch data flows, improving maintainability and performance of the pipeline.
Project : Linfox
Efficient Data Querying – Crafted complex Oracle PL/SQL queries to fetch, update, and manage logistics data with optimal performance.
Unit Testing for Reliability – Designed and executed test cases to validate SQL logic, ensuring accuracy before deployment.
Pattern-Based Extraction – Developed regular expressions for precise validation and extraction of structured data.
Process Documentation – Recorded code logic and workflows for smoother team collaboration, and easier maintenance.
Database Object Management – Maintained tables, views, and indexes to enhance query performance, and data consistency.
Version Control with Git – Tracked code changes, collaborated with team members, and maintained project history effectively.
Maintained Data Integrity – Enforced validation rules, constraints, and consistency checks within SQL queries to ensure data quality.
Project: HCA
Cross-System Connectivity – Built Python utilities to connect source and target systems, enabling seamless data processing using PySpark.
Generic PySpark Framework – Enhanced and modularized existing PySpark code for reuse across multiple workflows and platforms.
Data Retrieval via SQL – Developed efficient SQL queries to extract relevant datasets from databases for downstream processing.
Version Control & Collaboration – Managed code through GitHub, including pull, push, and commit operations for seamless teamwork.
Data Debugging and Validation – Investigated and fixed data issues by sampling, sorting, and analyzing extracted records in Oracle.
Troubleshooting and Optimization – Identified pipeline failures and performance bottlenecks, applying fixes to improve reliability.