Summary
Overview
Work History
Education
Skills
Timeline
Generic

Praveen Kumar Godati

Hyderabad

Summary

  • Big Data Engineer with 5 years of experience in Apache Spark (RDD, DataFrame, SQL), Hive, Sqoop, and HDFS. Expert in PySpark development, Hive performance tuning, AWS EMR deployment, and schema evolution with Avro. Skilled in integrating data from relational systems using Sqoop, automating ETL pipelines, and debugging distributed workflows. Experienced with file formats like Parquet, ORC, Avro, and orchestrating workflows using AWS Step Functions.
  • Experienced data engineer specializing in Apache Spark, with strong proficiency in RDDs, DataFrames, and SQL APIs using Python.
  • Having almost 5 years of experience in design and development of big data ecosystem
  • Skilled at optimizing Spark performance through memory tuning, efficient partitioning, and advanced serialization techniques.
  • Proficient in integrating Spark with big data ecosystems such as Hadoop, Hive, and Kafka to build end-to-end scalable data solutions.
  • Adept at processing a variety of data formats including Avro, Parquet, ORC, and JSON in both batch and semi-structured workflows.
  • Hands-on experience deploying data pipelines on AWS EMR and Amazon S3, with a focus on fault tolerance, cost efficiency, and large-scale processing.
  • Strong track record in implementing caching, persistency, and transformation strategies to support real-time analytics and machine learning pipelines.
  • Experienced in production troubleshooting, performance monitoring, and applying Spark best practices to maintain reliable distributed applications.
  • Built scalable, distributed data pipelines using PySpark on AWS EMR to process high-volume datasets stored in AWS S3.
  • Integrated AWS Step Functions to orchestrate multi-stage PySpark workflows, enabling automation, monitoring, and error handling across batch jobs.
  • Proficient in SQL queries and scripting for data validation and verification during ETL testing.
  • Knowledgeable about metadata management and testing metadata-driven ETL processes.

Overview

5
5
years of professional experience

Work History

Big data developer

Wipro
03.2023 - Current
  • Optimized Spark jobs for performance and resource utilization.
  • Implemented Spark SQL queries for data querying and aggregation.
  • Collaborated with data scientists to integrate machine learning models into Spark pipelines.
  • Developed custom Spark functions for complex data transformations.
  • Managed ETL processes with PySpark running on AWS EMR, utilizing AWS S3 for storage.
  • Configured Spark jobs on AWS EMR to efficiently read and write data from AWS S3.
  • Experienced in identifying and resolving performance bottlenecks in Hive, such as data skew, inefficient joins, and excessive shuffling.
  • Expertise in using Hive explain plans, query profiling, and metrics monitoring to diagnose query performance issues and optimize query execution.
  • Proficient in performing data validation and cleansing during data transfer using Sqoop's validation and cleansing options.
  • Adept in scheduling and automating Sqoop jobs for incremental runs.
  • Technologies: AWS S3, EMR,EC2, Apache Spark (RDD, DataFrame, Spark SQL), Parquet, Avro, ORC, Protobuf, JSON, CSV

Data engineer

Novartis
11.2020 - 01.2023
  • Debugged Spark jobs running on AWS EMR, identifying performance bottlenecks.
  • Implemented distributed data processing using PySpark on AWS EMR for batch workflows.
  • Developed custom data transformations in PySpark, leveraging AWS S3 for storage.
  • Automated Spark job executions on AWS EMR clusters using AWS Step Functions.
  • Tuned PySpark performance on AWS EMR by adjusting resource configurations.
  • Executed large-scale data processing using PySpark and AWS Hive on AWS EMR.
  • Designed Spark jobs to process data stored in AWS S3 and AWS Hive.
  • Used AWS Step Functions to manage and monitor PySpark job execution workflows.
  • Proficient in optimizing Hive query performance by tuning various configuration settings, such as memory allocation, parallelism, and compression.
  • Technologies: Hadoop, HDFS, Hive, Sqoop, Spark SQL

Education

MTech -

Birla Institute of Technology and Sciences
Pilani
12.2024

B.Sc -

Adikavi Nannaya University
10.2020

Skills

  • Data Eco System: Hadoop, Sqoop, Hive, Spark3
  • Cloud Skills: AWS,Azure
  • Databases: MS SQL Server, MySQL
  • Languages: Python, SQL, UNIX Shell Script
  • Operating Systems: Linux and Windows
  • ETL & Workflow Orchestration: Sqoop, Apache NiFi, AWS Step Functions, Bash scripting

Timeline

Big data developer

Wipro
03.2023 - Current

Data engineer

Novartis
11.2020 - 01.2023

MTech -

Birla Institute of Technology and Sciences

B.Sc -

Adikavi Nannaya University
Praveen Kumar Godati