Over 5+ years of IT experience in Big Data development using Hadoop, Hive, Spark, SQL, PySpark, and Python. Experienced with AWS cloud services, including S3, EMR, Glue, and Lambda. Proficient with the cloud data warehouse Snowflake. Skilled in writing Spark SQL code and developing Spark scripts with necessary optimizations. Designed and created Hive external and managed tables with performance tuning. Built pipelines for data transfer between Snowflake and AWS S3 storage. Worked with various file formats: Avro, ORC, Parquet, JSON, and CSV. Maintained code in GitHub and triggered jobs through AWS Step Functions. Experienced with the Cloudera Distribution CDP platform. Strong problem-solving and troubleshooting skills; effective team player. Proficient in Agile methodology, actively participating in daily stand-ups, sprint planning, and retrospectives.
#Project3: SCB-Athena RB Data Hub
Duration: 08/01/24 to 02/28/25
Technologies: Spark, Hive, SQL, HQL, Python, PySpark, Control-M, PyCharm, S3, Glue, Lambda
Description: Athena is a strategic program to create a centralized data hub for business analytics, starting with the retail unit and expanding to other business areas. Built on the Hive Big Data platform, it manages large-scale data processing while following strict guidelines to ensure data accuracy, performance, and compliance
Roles and responsibilities:
#Project 2: Cloud Data Matrix
Duration:
Technologies: Spark, Snowflake, SQL, Python, PyCharm, S3, Glue, Lambda, Step Functions.
The Cloud Data Matrix is the central platform for data exchange within the organization, facilitating transformation and processing. Data stage jobs have been migrated to Spark SQL and Snowflake SQL, enhancing platform efficiency and adaptability for optimal data management and exchange
Roles and responsibilities
Write and optimize SQL queries for comprehensive data analysis.
Perform complex data processing using Apache Spark SQL.
Perform data transformations in Snowflake using SQL (joins, aggregations).
Use AWS Glue to run Spark jobs for data transformation and processing through Snow SQL.
Create ETL jobs with AWS Glue to extract data from various sources and store it in Amazon S3.
Create S3 buckets with data lake storage, lifecycle policies, and secure access controls. Build pipelines to load and unload data between Snowflake and AWS S3 storage.
Process data into Snowflake for analytical querying. Integrate Lambda functions with AWS services and monitor performance.
Ensure pipelines have error handling and monitoring mechanisms using AWS CloudWatch.
Project 1: Unilever Insight Hub
Duration:
Technologies: Apache Spark, Hive, SQL, HDFS, Sqoop, IntelliJ IDE, and Shell Scripting. Description: Unilever analyzes customer data from various sources, including social media, online reviews, loyalty programs, and sales transactions, to understand consumer preferences, behavior patterns, and sentiment. This data is used to tailor marketing campaigns, develop targeted product offerings, and improve overall customer experience, while also identifying inefficiencies and potential risks for better decision-making and enhanced supply chain management
Roles and responsibilities
Design and develop ETL workflows using Spark SQL to create denormalized tables
create Hive external tables with appropriate dynamic partitions,
Create HDFS directories to store data and Hive tables
Implement data integrity and data quality checks in Hadoop using Hive and Linux scripts
collect, aggregate, and move data from servers to HDFS.
create Hive tables, load data, and write Hive queries running in MapReduce way.
Load and transform structured data into various file formats (Avro, Parquet) in Hive
Use Sqoop to export analyzed data to relational databases for report generation
perform actions and transformations (wider and narrow transformations) based on project requirements