Project: PVH.
Role: Data Engineer.
Environment: Azure Data Factory, Azure Databricks, PySpark, Spark SQL, Azure Synapse, Log Analytics Workspace, and EventHub.
Duration: Sep 2022 โ Present.
Roles and Responsibilities:
- Extracted, transformed, and loaded data from source systems to Azure storage services using Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics.
- Ingested data into Azure Data Lake, Azure Storage, Azure SQL, and Azure DW; processed data in Azure Databricks.
- Developed Spark applications using PySpark and Spark SQL to transform data from multiple file formats for analytical insights.
- I wrote automation SQL scripts for pipeline orchestration and data validation.
- Managed streaming data ingestion using Event Hub connection strings in Databricks.
- Estimated cluster sizing, and monitored and troubleshot Spark Databricks clusters.
- Applied the Spark DataFrame API for in-session data manipulation.
- Demonstrated deep knowledge of Spark architecture, including Spark Core, SQL, Streaming, Executors, Tasks, and Deployment modes.
- Implemented security and data governance policies using Databricks Unity Catalog.
- Served on a technical committee for wastewater treatment system design initiatives.
Project: Optum.
Role: Associate.
Environment: Azure Data Factory, Azure Databricks, PySpark, Spark SQL, Azure Data Lake, and Azure Blob Storage.
Duration: Sep 2020 โ Aug 2022.
Roles and Responsibilities:
- Provisioned Hadoop and Spark clusters to support an on-demand data warehouse, and enable data access for data scientists.
- Built data pipelines and processed data in Azure Databricks using PySpark and Spark SQL.
- Imported data from MySQL and other systems into Azure Data Lake and Azure Blob Storage.
- Created tables and performed data validation using Spark SQL in Azure Databricks.
- Loaded and transformed structured, semi-structured, and unstructured data for advanced analytics.
- Cleaned and parsed data for ingestion into Azure Databricks environments.
- Monitored system health, handled warning/failure logs, and optimized job execution.
- Reviewed application logs within Databricks, and managed storage-level logging.
- Managed ingestion pipelines from cloud storage (Azure Blob, ADLS) to Databricks.
- Enabled downstream consumption of refined data by data scientists and analysts.