Accomplished Senior Principal Software Engineer at Dell Technologies, specializing in multi-cloud big data solutions. Expert in Apache Spark and data governance, I drive impactful data pipeline management and enhance data quality. My collaborative approach ensures high availability and scalability in large-scale data lake architectures, delivering significant improvements in project outcomes.
Product Data Fabric: From DW/DL to a Centralized Lakehouse for Next-Gen Product Insights, Dell Technologies, This project aims to consolidate the companies’ diverse data assets from existing data warehouses and data lakes into a single, unified Lakehouse architecture. The goal is to enhance analytics, enable advanced AI/ML capabilities, and streamline data management for greater efficiency and insight., Spearheaded companies’ 'Product Data Fabric' initiative, building a cloud-agnostic Lakehouse to unify product-centric data (telemetry, attributes, design) from diverse sources (DW, DL, databases, CSV, JSON, ORC, Parquet). Managed 5TB+ daily ingestion and delivery to target systems for next-gen insights., Engineered Lakehouse pipelines with Spark (PySpark) on AWS EMR/Azure Databricks, optimizing data flow across the product lifecycle from manufacturing to sales. Reduced processing time by 30% for critical reports and ensured efficient data delivery to target systems., Managed unification of Dell's 5TB enterprise data lake into the Lakehouse (AWS S3/Azure Data Lake Storage), implementing Delta Lake/Lake Formation for governance across all product data attributes, preparing it for target systems., Developed ETL/ELT processes via AWS Glue/Azure Data Factory and Spark, orchestrating with Apache Airflow to deliver timely, accurate product insights and attribute data from various sources to target analytics and ML platforms., Implemented robust Lakehouse governance and security frameworks for sensitive product design and performance data, ensuring compliance and secure consumption for both internal Lakehouse use and external target system delivery., Mentored teams on Lakehouse adoption, fostering excellence in data engineering practices for companies’ global product data operations, including source ingestion and target delivery., Drove 15% cloud cost optimization for the new Lakehouse, enhancing efficiency for product insight generation and data distribution to target systems., Integrated ML model serving pipelines within the Lakehouse, enabling real-time inference from diverse product telemetry data for applications like predictive maintenance, with results delivered to operational target systems.
Unified Product Data Hub: Phase 1 Data Lake Development, Dell Technologies, Contributed to the design and implementation of a fault-tolerant data streaming platform using Apache Kafka and Apache Flink for real-time ingestion and analytics on product telemetry and usage data within the developing Data Lake., Developed and optimized Hive and Spark SQL queries on the foundational Hadoop clusters of the Data Lake, improving query performance by up to 25% for business intelligence users analyzing product performance and attribute data., Built automated data quality checks and monitoring systems to ensure the integrity and reliability of raw and curated product data ingested into the Data Lake., Participated in data modeling efforts for new analytical requirements, translating business needs into efficient schemas for organizing product master data and related attributes within the Data Lake., Migrated key on-premise Hadoop workloads containing product data to cloud-native services (e.g., AWS EMR, Azure Dataproc), streamlining operations and reducing overhead as part of the initial Data Lake development.