Client: Sight Machine (Jan 2020 – Present)
Role: Core Architect
Frameworks Development:
- SMWorkspace CLI Tool: Designed and developed with 30+ features (pipeline listing, dashboard updating, Git-style diffing/reporting, data summarizing, etc.), integrated into product and used by all customers.
- Operator Testing Framework: Validates changes instantly without pipeline execution, reducing test iteration time by ~80%.
- Data Quality Check Framework: Built for General Mills – replacing manual notebooks with dynamic, no-code analytics for factory teams.
- Spike and Sudden Drop Detection System: Built a Z-score+IQR-based anomaly detection engine to flag spikes in Kafka topic streams. Optimized ingestion and processing using concurrent Python threads.
- SEEQ Integration Tool: Built a fully automated Python tool that extracts and parses analytical sheet data from SEEQ APIs, performs dependency resolution using NetworkX graphs and topological sorting, and translates it into Sight Machine-compatible data dictionaries using ANTLR4 parsing. Eliminated manual dictionary creation and streamlined ingestion into analytics pipelines.
Role: Lead Data Engineer
Client: Orora via Sight Machine | Stack: Python, Kafka, Spark, GCP, Azure
- Owned and executed end-to-end migration of all 7 Orora facilities from ETL2 to ETL3 pipelines.
- Built roadmap, scoped all data sources, performed transformation mapping, Kafka topic integration, and visualization alignment.
- Developed and validated high-efficiency pipelines tailored for each site with optimized data processing and minimized latency.
- Supported full GCP-to-Azure migration, collaborating with infra team for coordination, validation, and UAT for production cutovers.
Role: Data Engineer
Client: Chamberlain Group | Stack: Azure Data Factory, Databricks, PySpark, DBT, Delta Live Tables (DLT)
- Developed end-to-end DLT pipelines on Azure Databricks, transforming source data into modeled tables based on customer-defined star schemas.
- Worked across the pipeline lifecycle: identifying source tables, designing inter-table dependencies, and building robust transformations aligned with business requirements.
- Applied PySpark and DBT for modular, testable transformation logic and semantic layer creation.
- Participated in Alpha and Unit Testing of data pipelines, validating complex business logic and ensuring correctness across stages before deployment.
- Gained hands-on experience with Azure Data Factory, orchestrating workflows and data movement in a scalable environment.
Role: Data Engineer / Solution Architect
Client: LeoLabs | Stack: AWS Lambda, S3, SNS, SQS, Redshift, EMR, PySpark, Deequ, CloudWatch
Near Real-Time CDM Pipeline:
- Architected and implemented a real-time ETL workflow that ingests, transforms, and loads CDM JSON files from zip bundles via event-driven AWS services.
- New zip files are uploaded to S3, triggering SNS → SQS messages.
- A Lambda function processes messages in batches, extracts and consolidates CDM JSON files, and loads them into Redshift using COPY from an intermediate S3 bucket.
- Designed for low-latency, auto-triggered ingestion without manual intervention.
External CDM ETL using Spark:
- Developed a scheduled Spark job on AWS EMR, triggered via Lambda + CloudWatch, that reads hourly data from RDS, performs transformations, and loads partitioned output to S3.
- Data is validated using Amazon Deequ, enforcing strict quality checks (datatype, permissible values, null columns).
- Only clean, validated data proceeds to Redshift, while failure cases are logged and reported through CloudWatch metrics.
Backfill Support & Scaling:
- Built robust backfill jobs for both CDM and external CDM data, supporting date-based and ID-based ranges.
- Designed EMR steps to run parallel, non-overlapping backfills with full metadata logging and progress tracking.
- Ensured Redshift load handling, concurrency control, and no data duplication.