
4+ years of experience in the data domain, including data engineering, data cleaning, data wrangling, data transformation, and data analysis. Hands-on experience with the Hadoop ecosystem, including Spark, HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, and Kafka. Proficient with AWS services such as EMR, S3, EC2, Glue, Athena, and Redshift. Experience transferring data from RDBMS systems to HDFS and Hive tables using Sqoop. Hands-on experience in exploratory data analysis using numerical functions and visualizations to support feature engineering. Strong experience developing PySpark scripts to manage data transformations and data delivery for batch and streaming workloads. Designed and developed streaming pipelines using Apache Kafka and PySpark from multiple sources such as APIs and data lakes to optimize monitoring performance. Developed PySpark jobs in Databricks to implement ETL pipelines sourcing data from S3 and loading into Snowflake. Experience creating Hive tables, partitions, and buckets and writing HiveQL queries to optimize performance. Developed data migration shell scripts in Linux to load data from DB2 into Hive tables while applying necessary transformations using big data technologies. Hands-on experience across big data lifecycle phases including data ingestion, data analytics, and data visualization. Expertise in SQL, including window functions, CTEs, complex joins, and advanced date, time, and conditional aggregations. Experience designing time-driven and data-driven automated workflows using Oozie. Created shell scripts to orchestrate pipeline architectures involving Python, SQL, Syncsort ETL, and ThoughtSpot jobs.