2.5+ years of experience in IT, which includes experience in Big data Technologies, Hadoop ecosystem, Spark Framework and Currently working on Spark framework extensively using Scala as the main programming dialect
Good understanding of Hadoop architecture and various components such as HDFS, Job tracker, Task Tracker, Name Node and Data Node.
Developed Spark applications using Scala, Spark SQL, Spark RDD and Spark Data Frame API for data cleaning and processing tasks
Good Knowledge in loading the data from SQL Server and MySQL databases to HDFS system using SQOOP (Structured Data) .
Experience with AWS components like AWS EMR, EC2 instances, S3 buckets.
Decent familiarity with UNIX/Linux systems, including the ability to understand the interaction between applications and the operating system.
Hands-on experience in application development leveraging Scala, Spark, Hive, and Sqoop, with basic proficiency in Python for specific tasks within Spark jobs.
Detailed understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Scrum and Agile.
Apache Spark
Created Data Frames and performed analysis using Spark SQL.
Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala and Python.
Excellent understanding of Spark Architecture and framework, Spark Session, Spark Context, APIs, RDDs, Spark SQL, Data frames.
Experienced in optimizing Spark Jobs performance by tuning various configuration settings, such as memory allocation, caching, and serialization.
Apache Sqoop
Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and Gzip.
Performed transformations on the imported data and Exported back to RDBMS.
Apache Hive
Experience in writing queries in HQL (Hive Query Language), to perform data analysis.
Created Hive External and Managed Tables.
Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.
Overview
2
2
years of professional experience
Work History
Data Engineer
Ipsos Research Pvt. Ltd
03.2023 - 09.2023
Collaborated with data modeling teams, stakeholders, and data analysts to comprehend data requirements and translate them into technical specifications and structured data representations.
Developed Spark applications in Scala for performing data cleansing, event enrichment, data aggregation, and data preparation to meet business requirements.
Implemented data quality checks and validation processes to ensure accuracy, consistency, and completeness of data.
Worked on various data formats like AVRO, Sequence File, JSON, Parquet and XML.
Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance.
Debugged common issues with Spark RDDs and Data Frames, resolved production issues, and ensured seamless data processing in production environments.
As per the business requirement storing the spark processed data in HDFS/S3 with appropriate file formats.
Performed Import and Export of data into HDFS and Hive using Sqoop and managed data within the environment.
Created EC2 instances and EMR clusters for Spark Code development and testing.
Performed step execution in EMR clusters for the job deployment as per requirements.
Used Agile Scrum methodology/ Scrum Alliance for development.