A competent professional with 8 years of expertise in Hadoop with Spark and it’s Ecosystems – mainly in Hadoop, Sqoop, Hive, Spark, PySpark, Python, Scala, AWS and Elastic Search
Overview
8
8
years of professional experience
Work History
Data Engineer
IBM
06.2024 - Current
Migrated the India jobs from SG on-prem servers to GCP India server as per RBI guidelines
Developed the new jobs in UAT environment and performed data enrichments such as filtering, aggregation using Spark, PySpark and Hive as per business requirement in Sparkola tool
Enable the jobs in Airflow and validated the regular runs in jobserver
Deployed the jobs in UAT, QA and PROD environments
Validated the jobs in Jobserver and dependencies in Airflow
Senior Project Engineer
Wipro Limited
10.2021 - 06.2024
Creating the RDDs, DFs for the required input data and performed the data transformations, actions using spark-core and Spark-Data Frames
Build data pipelines that are scalable, repeatable, and secure, and can serve multiple purposes
Constructing a state-of-the-art data lake on AWS using EMR, Spark, Step Functions, CloudWatch Events
Experience in usage of Amazon EMR for processing Big Data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
Experience on Spark Architecture including Spark Core, RDD, Data Frames, Data Sets, Spark SQL, Spark Streaming, and experience in importing the data from source HDFS into Spark RDD for in-memory data computation to generate the output response
Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using SparkSQL
Experience in converting Hive/SQL queries into RDD transformations using Spark, Scala
Worked with Apache Spark components which provides fast and general engine for large data processing
Migrated an existing on-premises application to AWS
Designed, Built, and Deployed multiple applications utilizing the AWS stack EC2, S3, EMR focusing on high-availability, fault tolerance, auto-scaling
Designed and built custom ETL processes in AWS using Lambda functions, EMR Clusters reducing the cost overhead for the client
Developed and maintained automated CI/CD pipeline for the code deployment
With the help of this it is easy to manage already existing infrastructure and complex change- sets can be applied to infrastructure with minimal human interaction and thereby avoids many possible human errors
This was achieved using technologies such as Terraform, Jenkins, GitHub, AWS CICD services
Provide daily monitoring, management, troubleshooting and issue resolution to systems and services hosted on cloud resources
Developed the Spark code to perform data enrichments and calculations as per business requirement
Worked on performance optimization of various ecosystems such as Hive, Sqoop, Spark, Elastic Search and Kibana
Performed Data enrichments such as filtering, sorting and aggregation using Spark and Hive
Loading fact tables into Elastic search for visualization through Kibana
Created Dashboards and Visualizations in Kibana as per business requirements to monitor the day to day over changes in data
Software Engineer
EPAM Systems
05.2021 - 10.2021
Creating the RDDs, DFs for the required input data and performed the data transformations, actions using spark-core
Worked closely with business customers for Requirement gatherings
Designed Hive repository with external tables, internal tables, buckets, partitions, and ORC compressions for incremental data load of parsed data
Worked on performance optimization of various ecosystems such as Hive and Sqoop, Spark
Performed Data enrichments such as filtering, sorting and aggregation using Spark
Worked on building the scripts for the resource's creation in the AWS Cloud like step functions Glue Jobs, Lambda Handler etc
Hands on experience in building pipelines to implement the Business use case functionalities by performing the transformations
Software Engineer
Optum Global Solutions India PVT Ltd
10.2019 - 05.2021
Experience in Designing Hadoop and Spark Applications and recommending the right solutions and technologies for the applications
RDBMS Tables have been imported/exported using Sqoop
Used Apache Hive to run map reduce jobs on top of this HDFS Data
Built distributed in-memory applications using SPARK core and SPARK SQL to do analytics efficiently on huge data sets
Experience on creating the RDDs, DFs for the required input data and performed the data transformations, actions using spark-core
Worked closely with business customers for Requirement gatherings
Developing Sqoop jobs with incremental load from heterogeneous RDBM(Oracle) using native dB connectors
Designed Hive repository with external tables, internal tables, buckets, partitions, and ORC compressions for incremental data load of parsed data
Experienced in developing Hive Queries on different data formats like Text file, CSV file, Log files
Leveraging time-based partitioning yields improvement in performance using HiveQL
Created Hive external tables for the data in HDFS and moved data from archive layer to business layer with hive transformations
Worked on performance optimization of various ecosystems such as hive and Sqoop
Improvising the tuning options using HIVE functions such as Partitioning, Bucketing, Index, CBO etc
Software Engineer
HCL Technologies
05.2017 - 10.2019
Used Apache Hive to run map reduce jobs on top of this HDFS Data
Built distributed in-memory applications using SPARK and SPARK SQL to do analytics efficiently on huge data sets
These applications were built using Spark Scala API and used YARN as resource manager
Experience on creating the RDDs, DFs for the required input data and performed the data transformations, actions using spark-core
Performed Data Enrichment, cleansing and common data aggregations through RDD transformations
Interactive analysis of Hive tables through various data frame operations using SparkSQL
Involved in performance optimization of Spark Jobs and designed efficient queries
Performed Import and Export of data into HDFS using SQOOP
Handled heterogeneous data sources such as Oracle and different file formats
Created Sqoop jobs with incremental load to populate Hive External tables
Performed Data enrichments such as filtering, sorting and aggregation using Hive
Worked on performance optimization of various ecosystems such as hive and Sqoop
Improvising the tuning options using HIVE functions such as Partitioning, Bucketing, Index, CBO etc
Experienced in developing Hive Queries on different data formats like Text file, CSV file, ORC files and leveraging time-based partitioning yields improvement in performance using HiveQL
Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner
Getting connected with the onshore team to review the code and validation of final results