Organized and self-motivated Data Engineer with six years of success designing and implementing database solutions. Goal-orientated with ability to understand business problems and create systems to improve functionality. Work effectively, independently, and in collaborative settings.
- Built streaming pipelines on Amazon Managed Service for Apache Flink (Python), integrating Kinesis Data Streams, Redis, Kinesis Firehose, and S3 with CloudWatch/SNS alerting.
- TypeAhead Search : Designed end-to-end Flink solution that ingests from Kinesis, applies entity identification/business rules, writes low-latency features to Redis, and ships enriched data via Firehose to S3 for downstream analytics (Athena).
- Context Engine (POCs): Implemented multiple Flink POCs and a Neo4j-backed graph layer; consumed Kinesis streams and persisted entities/relationships for graph-driven queries.
- Packaging/Deployment: Standardized Flink packaging—built dependency JAR (Maven/pom.xml), bundled Python code, requirements, connectors, and extra libs into code.zip; published to S3 and configured Managed Flink to consume artifacts.
- Reliability/Observability: Added CloudWatch metrics/alarms for Redis and Flink crash detection; integrated SNS email notifications for proactive incident response.
- Data Quality on Glue: Authored a reusable Python package for batch data quality checks; enabled teams to import and run standardized validations across datasets.
- Stack: Lambda, Managed Flink (Python), Kinesis Streams, Firehose, Redis, S3, Neo4j, Glue, EMR, EC2, IAM, SNS, Athena, SageMaker, EMR Studio, CloudWatch.
- Led migration of legacy ETL to PySpark on Hadoop/EMR, increasing throughput and system load capacity; optimized Spark jobs (~65% faster) via partitioning, join strategy tuning, and Parquet optimizations.
- Built scalable ETL/ELT pipelines in Python/PySpark to cleanse, transform, and load data into an S3-based data lake (Parquet); enabled downstream consumption in analytics/BI.
- Orchestrated batch pipelines with Airflow DAGs on EC2, managing dependencies, scheduling, retries, and automation for end-to-end workflows and backfills.
- Created automation with Bash/Shell to export from relational stores (MySQL/PostgreSQL) to HDFS/S3 as Parquet; reduced manual ops and improved reliability.
- Implemented features to extract, transform, and load weekly PostgreSQL updates into target stores, ensuring schema compatibility and data freshness SLAs.
- Established comprehensive test and data validation coverage (unittest/pytest) for transformed datasets; enforced business rules and anomaly checks prior to release.
- Set up CI/CD pipelines in Azure DevOps (YAML) for packaging, testing, and deploying data jobs; improved release cadence and consistency.
- Performed data analysis to design optimal aggregations and reporting flows, improving processing efficiency by ~30%; documented logic and data lineage.
- Drove code reviews, version control best practices (Git), and production readiness checks; collaborated with cross-functional teams for requirements and delivery.
- Supported platform hygiene and stability (OS/Hadoop updates, patches, version upgrades); built diagnostic monitors to detect output changes and notify via email.
- Contributed to UAT, defect triage, and production issue resolution within Agile/Scrum ceremonies using Jira.
Blue Vector - Data Team, 01/01/19, 04/01/19, Build end to end ETL for customer data, using pandas, sql, and Google Notebook. Developed different modules in python and pandas to interact with Mysql DB and reading tables, and applying business logic and finally loading it to Target locations.