Results-driven Data Engineer with 3 years of experience designing and building scalable data pipelines and backend systems using Python.
Developed PySpark transformations on AWS EMR to process large AWS S3 datasets. Debugged failed AWS Step Functions orchestrating PySpark jobs on AWS EMR.
Utilized AWS S3 for storing log files generated by Spark jobs on AWS EMR. Automated PySpark job execution on AWS EMR with orchestration in AWS Step Functions.
Tuned AWS EMR cluster settings for optimized data processing in PySpark applications.
Designed fault-tolerant PySpark applications on AWS EMR with AWS S3 as the data source.
Automated the creation of AWS EMR clusters for running PySpark jobs that interact with AWS S3.
Debugged complex PySpark issues on AWS EMR by analyzing detailed error logs stored in AWS S3. Integrated AWS Step Functions for orchestrating Spark workflows, optimizing task execution order.
Tuned resource allocation on AWS EMR to reduce costs while running large-scale PySpark jobs.
Ability to troubleshoot common issues with Spark RDD, such as data processing errors, performance bottlenecks, and scalability limitations.
Experienced in importing and exporting large datasets between Hadoop and relational databases using Sqoop.
Experienced in efficiently using Hive managed and external table with respect to the business requirement.
Familiarity with Spark DataFrame APIs and SQL syntax and ability to write complex SQL queries and DataFrame operations to solve business problems.
Overview
3
3
years of professional experience
1
1
Certification
Work History
Spark Developer
AECO Energy
08.2024 - Current
Collaborated with data scientists to integrate machine learning models into Spark pipelines.
Developed custom Spark functions for complex data transformations.
Debugged and fixed performance issues in PySpark jobs on AWS EMR.
Skilled in handling semi structured/serialised data processing using hive (AVRO,PAQUET,ORC)
Utilized AWS EC2 spot instances to run cost-effective AWS EMR clusters for PySpark jobs.
Implemented efficient joins in PySpark jobs on AWS EMR to process relational data stored in AWS S3.
Monitored AWS EMR cluster performance and optimized resource usage for long-running PySpark jobs.
Used AWS Step Functions to sequence PySpark jobs running on AWS EMR and manage dependencies.
Debugged and resolved resource contention issues in PySpark jobs running on AWS EMR.
Designed AWS Hive tables to store and analyze data processed by Spark jobs on AWS EMR.
Automated Spark job submission on AWS EMR using AWS Step Functions for consistent execution.
Integrated AWS S3 with PySpark for intermediate storage in large-scale AWS EMR workflows.
Used AWS Step Functions to parallelize PySpark jobs for faster execution on AWS EMR clusters.
Spark Developer
Tata Consultancy Services
07.2022 - 07.2024
Expertise in using Spark DataFrame transformations and actions to process large-scale structured and semi-structured data sets, including filtering, mapping, reducing, grouping, and aggregating data.
Strong understanding of Spark RDD integration with other big data technologies, such as Hadoop, Hive, and their impact on data processing workflows and performance.
Managed ETL processes with PySpark running on AWS EMR, utilizing AWS S3 for storage.
Version-controlled Airflow DAGs using Git and managed deployments with CI/CD pipelines.
Proficient in optimizing Sqoop imports and exports for performance and scalability.
Proficient in handling hive partitions and buckets with respect to the business requirement.
Client: Phoenix Group Holdings, UK
Education
B.E. - Electronics and Communication
P.A. College of Engineering and Technology
Pollachi, Tamilnadu
04.2022
Skills
Big Data: Hadoop, HDFS, Hive, Sqoop, Spark
Languages: C, C, Python, Perl
ETL/Orchestration: Airflow, Glue
RDBMS: Oracle, MS SQL Server, MySQL
Warehouses: RedShift, Snowflake, PostgreSQL
Cloud: AWS (S3, EC2, Lambda, EMR, Glue, Airflow)
Front end: HTML, CSS
IDE and Build Tools: Visual Studio Code, JIRA, PyCharm, IntelliJ IDEA, Microsoft Copilot
Version Control: Gitlab, Git
Certification
Data manipulation using Pandas (Udemy)
Python Programming Basics (Hackerrank)
Office Automation (Bharathidasan University)
Recognition
Received 40k rupees incentive for clearing Ion Proctored Assessment.
Publications
In Connection with the Journal Publication under ICADSIS 2022 Conference, our paper "IoT enabled paddy field monitoring and disease detection system” has been published in Indian Journal of Natural Sciences WOS indexed journal Vol.No 13 , Issue No.73, August 2022.
Published research paper on "A Review of Advancements in Battery Technologies for Electric Vehicles" in an AICTE-sponsored international conference conducted online on "Future Electric Vehicular Mobility and Its Challenges (ICFEVMC-2021)"
Product Delivery Leader - Renault & Nissan at Visteon Technical And Services CenterProduct Delivery Leader - Renault & Nissan at Visteon Technical And Services Center