• Over 10+ Years of IT experience in Data Engineering and Application Development using Big Data Technologies and Cloud Services.
• Working experience in Hadoop ecosystem (Gen-1 and Gen-2) and its various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager (YARN), Application master Node manager.
• Experience with components such as Cloudera distribution encompassing components like MapReduce, Spark, SQL, Hive, HBase, Sqoop, Pyspark.
• Good skills on NoSQL Database- Cassandra.
• Proficient in developing Hive scripts for various business requirements.
• Knowledge in Data Warehousing Concepts in OLTP/OLAP System Analysis and developing Database Schemas like Star Schema and Snowflake Schema for Relational and Dimensional Modelling.
• Good hands on in creating custom UDF’s in Hive.
• Load and transform large sets of structured, semi-structured and unstructured data from Relational Database Systems to HDFS and vice-versa using Sqoop tool.
• Working knowledge on Hive UDF and various joins.
• Good Experience on architecture and components of Spark, and efficient in working with Spark Core, Data Frames/Data Sets/RDD API/Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
• Hands-on experience in Spark, Scala, SparkSQL, Hive Context for Data Processing.
• Working knowledge on GCP tools like Cloud Function, Dataproc, Big Query.
• Experience on Azure cloud i.e., ADF, ADLS, Blob Storage, Databricks, Synapse etc.
• Extensive working experience in an Agile development Methodology & Working knowledge on Linux. • Expertise in working with big data distributions like Cloudera and Hortonworks.
• Automated data pipelines using streams & tasks Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames Application programming interface (API).
• Experience in working with Hive data warehouse tool-creating tables, distributing data by doing static partitioning and dynamic partitioning, bucketing, and using Hive optimization techniques.
• Experience in tuning and debugging Spark application and using Spark optimization techniques.
• Knowledge on architecture and components of Spark and demonstrated efficiency in optimizing and tuning compute and memory for performance and price optimization.
• Expertise in developing batch data processing applications using Spark, Hive and Sqoop.
• Experience in working with CSV, JSON, XML, ORC, Avro and Parquet file formats.
• Good experience in creating and designing data ingest pipelines using technologies such as Apache Kafka.
• Worked on most of the popular AWS stack like S3, EC2, EMR, Athena.
• Good knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
• Basic Experience in implementing Snowflake Data Warehouse.
• Experience in working with version control systems like Git, GitHub, CI/CD pipelines.
Client: Pepperstone (A FinTech Startup)
As part of this role, I am responsible for building scalable and efficient data pipeline with AWS Services, Kafka Streams, Spark Structured Stream, Kubernetes, and Docker.
Data Analytics and Integration Services (DAAIS)
• Responsible for building scalable distributed data solutions using Spark.
• Ingested log files from source servers into HDFS data lakes using Sqoop.
• Developed Sqoop Jobs to ingest customer and product data into HDFS data lakes.
• Developed Spark streaming applications to ingest transactional data from Kafka topics into Cassandra tables in near real time.
• Developed an spark application to flatten the transactional data coming from using various dimensional tables and persist on Cassandra tables.
• Involved in developing framework for metadata management on HDFS data lakes.
• Worked on various hive optimizations like partitioning, bucketing, vectorization, indexing and using right type of hive joins like Bucket Map Join and SMB join.
• Worked with various files format like CSV, JSON, ORC, AVRO and Parquet.
• Developed HQL scripts to create external tables and analyze incoming and intermediate data for analytics applications in Hive.
• Optimized spark jobs using various optimization techniques like broadcasting, executor tuning, persisting etc.
• Responsible for developing custom UDFs, UDAFs and UDTFs in Hive.
• Analyze the tweets json data using hive SerDe API to deserialize and convert into readable format.
• Orchestrating Hadoop and Spark jobs using Oozie workflow to create dependency of jobs and run multiple Jobs in sequence for processing data.
• Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
• Skills/Technologies/Tools: Apache Hadoop, Apache Spark, Spark SQL, Spark Streaming, Hive, Cassandra, MySQL, HDFS,Apache Kafka, Python, Scala.
Wealth Management Technology (WMT)
• Analyzed data using Hadoop components with Hive, Pig Queries and HBase queries.
• Load and transform large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts.
• Involved in loading data from the UNIX file system to HDFS.
• Responsible for creating Hive tables, loading data, and writing hive queries.
• Handled importing data from various data sources, performed transformations using Hive, Map Reduce/Apache Spark, and loaded data into HDFS.
• Extracted the data from Oracle Database into HDFS using the Sqoop.
• Loaded data from Web servers and Teradata using Sqoop, Spark Streaming API.
• Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and improved the cluster performance.
• Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.
Client: Vodafone Italy (VFIT)
Project Undertaken: Building Big Data Pipeline and Warehousing solution to analyze the Data transformations that match a legacy system's data.
This involves building a data lake. Data sources use Hadoop tools to transfer data to and from HDFS and some of the sources, were imported using sqoop, then storing the raw data into HIVE tables in ORC format in order to facilitate the data scientists to perform analytics using HIVE. New use cases were developed and dumped into a NOSQL database (Hbase) for further analytics.
• Developed SQOOP scripts to import the source data from Oracle database into HDFS for further processing.
• Developed HIVE Script to store raw data in ORC format.
• Involved in gathering requirements, designing, development and testing.
• Generated reports using Hive for business requirements received on ADHOC basis.
• Skills/Technologies/Tools: Cloudera CDH, Hadoop, HDFS, Hive, Sqoop, Hbase.
Data Engineering
CERTIFICATE OF RECOGNITION
• Received appreciation for invaluable contribution in two projects.
CERTIFICATION OF EXCELLENCE
• Recognized for performance, dedication and support provided in the project.
AWS Certified Solution Architect Associate