Having almost 7 years of experience in designing and developing Big Data applications using the Hadoop Ecosystem technologies (HDFS, Hive, Sqoop, Apache Spark and AWS). Experienced in optimizing Spark RDD performance by fine-tuning various configuration settings, including memory allocation, caching strategies, and serialization methods.
● Experienced in optimizing Spark RDD performance by fine-tuning various configuration settings, including memory allocation, caching strategies, and serialization methods.
● Expertise in using Spark RDD transformations and actions to process large-scale structured and unstructured data sets, including filtering, mapping, reducing, grouping, and aggregating data.
● Skilled in using Spark RDD persistency and caching mechanisms to reduce data processing overhead and improve query performance.
● Familiar with schema and data type operations, such as adding, renaming, and dropping columns, casting data types, and handling null values.
● Skilled in optimizing Spark SQL performance through memory allocation, caching, and serialization.
● Proficient in processing serialized data using Avro, Parquet, ORC, and Protobuf.
● Experienced with binary and textual data formats, such as CSV, JSON, and XML, including serialization and deserialization using Spark DataFrames and RDDs.
● Optimized Spark jobs and data workflows for scalability, performance, and cost efficiency using partitioning, compression, and caching.
● Performed data cleansing and preprocessing using Spark transformations.
● Implemented Spark SQL queries for data querying and aggregation.
● Proficient in setting up and customizing Google Dataproc clusters, including cluster resizing and configuration tuning.
● Experienced in ETL testing methodologies, including data extraction from various sources, workflow testing, job scheduling, and transformation testing.
● Executed data cleansing and preprocessing tasks using Spark transformations to prepare data for analysis and reporting.