Big Data Engineer with 5 years of experience in Apache Spark (RDD, DataFrame, SQL), Hive, Sqoop, and HDFS. Expert in PySpark development, Hive performance tuning, AWS EMR deployment, and schema evolution with Avro. Skilled in integrating data from relational systems using Sqoop, automating ETL pipelines, and debugging distributed workflows. Experienced with file formats like Parquet, ORC, Avro, and orchestrating workflows using AWS Step Functions.
Experienced data engineer specializing in Apache Spark, with strong proficiency in RDDs, DataFrames, and SQL APIs using Python.
Having almost 5 years of experience in design and development of big data ecosystem
Skilled at optimizing Spark performance through memory tuning, efficient partitioning, and advanced serialization techniques.
Proficient in integrating Spark with big data ecosystems such as Hadoop, Hive, and Kafka to build end-to-end scalable data solutions.
Adept at processing a variety of data formats including Avro, Parquet, ORC, and JSON in both batch and semi-structured workflows.
Hands-on experience deploying data pipelines on AWS EMR and Amazon S3, with a focus on fault tolerance, cost efficiency, and large-scale processing.
Strong track record in implementing caching, persistency, and transformation strategies to support real-time analytics and machine learning pipelines.
Experienced in production troubleshooting, performance monitoring, and applying Spark best practices to maintain reliable distributed applications.
Built scalable, distributed data pipelines using PySpark on AWS EMR to process high-volume datasets stored in AWS S3.
Integrated AWS Step Functions to orchestrate multi-stage PySpark workflows, enabling automation, monitoring, and error handling across batch jobs.
Proficient in SQL queries and scripting for data validation and verification during ETL testing.
Knowledgeable about metadata management and testing metadata-driven ETL processes.
Overview
5
5
years of professional experience
Work History
Big data developer
Wipro
03.2023 - Current
Optimized Spark jobs for performance and resource utilization.
Implemented Spark SQL queries for data querying and aggregation.
Collaborated with data scientists to integrate machine learning models into Spark pipelines.
Developed custom Spark functions for complex data transformations.
Managed ETL processes with PySpark running on AWS EMR, utilizing AWS S3 for storage.
Configured Spark jobs on AWS EMR to efficiently read and write data from AWS S3.
Experienced in identifying and resolving performance bottlenecks in Hive, such as data skew, inefficient joins, and excessive shuffling.
Expertise in using Hive explain plans, query profiling, and metrics monitoring to diagnose query performance issues and optimize query execution.
Proficient in performing data validation and cleansing during data transfer using Sqoop's validation and cleansing options.
Adept in scheduling and automating Sqoop jobs for incremental runs.