Shilpa Jadaun

Big Data Engineer

Hol Infosolutions Pvt Ltd

12.2020 - Current

Executed large-scale data processing using PySpark and AWS Hive on AWS EMR
Built data pipelines with AWS EMR and PySpark, reading from and writing to AWS S3
Designed Spark jobs to process data stored in AWS S3 and AWS Hive
Used AWS Hive to perform SQL queries on datasets processed by Spark on AWS EMR
Debugged PySpark code to resolve errors and improve efficiency on AWS EMR
Created and managed RDDs (Resilient Distributed Datasets) for data transformations
Utilized Data Frames for structured data manipulation and analysis
Worked with Spark's data serialization formats (Avro, Parquet, JSON, etc.)
Automated Spark job submission on AWS EMR using AWS Step Functions for consistent execution
Integrated AWS S3 with PySpark for intermediate storage in large-scale AWS EMR workflows
Used AWS Step Functions to parallelize PySpark jobs for faster execution on AWS EMR clusters
Designed and implemented ETL processes using Spark
Collaborated with data architects to design data storage solutions
Worked with Spark Data Frames for feature engineering
Integrated Spark with data lakes such as AWS S3, HDFS, EMR
Implemented Spark partitioning and caching strategies
Implemented data partitioning and shuffling strategies for optimization
Experience in handling hive schema evolution with Avro file format
Skilled in handling semi structured/serialised data processing using hive (AVRO,PAQUET,ORC)
Experienced in efficiently using Hive managed and external table with respect to the business requirement
Strong understanding of Hive serialized data processing performance optimization techniques, such as using columnar storage, data partitioning, and indexing, and their trade-offs in terms of query performance and resource utilization
Experienced in using Sqoop to import and export data from and to cloud-based data storage services such as Amazon S3
Developed Sqoop scripts to perform data transformations and data cleansing during data import from external databases into Hadoop clusters
Deployed PySpark jobs on AWS EMR clusters provisioned with specific AWS EC2 instances for cost optimization
Tuned PySpark jobs on AWS EMR to handle large-scale data stored in AWS S3
Implemented data aggregation and transformation in PySpark jobs on AWS EMR
Used AWS Hive to query structured data within AWS EMR jobs
Managed Spark job orchestration on AWS EMR using Airflow
Proficient in configuring Sqoop to import and export data using custom SQL queries and stored procedures
Proficient in writing Sqoop commands to transfer data between Hadoop and various databases such as MySQL, and SQL Server
Implemented efficient joins in PySpark jobs on AWS EMR to process relational data stored in AWS S3
Monitored AWS EMR cluster performance and optimized resource usage for long-running PySpark jobs

Summary

Overview

Work History

Big Data Engineer

Mainframe System Engineer

Education

Bachelor of Technology -

XII -

Skills

Timeline

Big Data Engineer

Mainframe System Engineer

Bachelor of Technology -

XII -

Similar Profiles

Vrinda SinghalVrinda Singhal

Vivek AshokanVivek Ashokan

JYOTI BAJAJ DHAMIJAJYOTI BAJAJ DHAMIJA

Puneet BhatiaPuneet Bhatia

AMAN JOSHIAMAN JOSHI