Summary

Overview

Work History

Education

Skills

Interests

Timeline

Language Known

Disclaimer

Piyush Chandra

Summary

Big Data Engineer with 10.5 years of experience, specializing in PySpark and distributed data processing. Strong background in building scalable ETL pipelines, optimizing Spark performance, and working with large datasets across on-premises and cloud environments. Passionate about delivering reliable, efficient, and business-driven data solutions.

Overview

years of professional experience

Languages

Work History

Data Engineer

Cognizant

03.2023 - Current

Conducted a thorough requirements analysis by collaborating with business stakeholders and technical teams to understand functional needs, data sources, reporting expectations, and performance goals for both on-premise and AWS-based solutions.
Designed and optimized robust schema structures for Hive tables, ensuring efficient storage, partitioning strategies, and query performance, while aligning with data governance, and scalability standards.
End-to-end development and deployment of near-real-time (NRT) reporting solutions from the ground up, including data ingestion, transformation logic, and delivery of insights through scheduled and event-driven pipelines.
Performed extensive performance tuning of Spark and Hive jobs by analyzing execution plans, optimizing SQL queries, adjusting resource configurations, and implementing best practices to enhance throughput, and reduce latency.
Created and maintained Spark framework XML configuration files, embedding SQL scripts to manage ETL workflows from source systems to target data stores, ensuring reusability, modularity, and operational stability.
Integrated with Apache Kafka to consume real-time event streams as part of the NRT reporting architecture, handling schema variations, offset management, and fault tolerance to support continuous data processing.
Implemented logic fixes and enhancements to meet evolving business requirements, including debugging complex data issues, adjusting transformation rules, and validating results against expected outcomes.
Enhanced existing Spark and SQL logic to introduce new business attributes, enrich Hive table structures, and enable advanced analytical use cases through column additions, and optimized joins.
Led the migration of legacy on-premise Spark/Scala jobs to AWS, re-architecting pipelines using PySpark on Amazon EMR, and modernizing ETL processes for scalable and cost-efficient cloud execution.
Developed and orchestrated automated workflows using Apache Airflow DAGs on AWS, including scheduling, dependency management, error handling, and monitoring for PySpark jobs running in cloud environments.
Collaborated with cross-functional teams during AWS migration planning, including data engineering, DevOps, and cloud architects, to define migration strategies, data validation checkpoints, rollback plans, and performance benchmarks.
Ensured a smooth transition and operational readiness by documenting design decisions, conducting unit and integration testing on AWS environments, and supporting deployment validation with CI/CD pipelines.
Involved in validating the data for the new environment setup, parallel to PROD.
Transformed the batch report to a near real-time report using Kafka and PySpark.

Data Engineer

IBM India Pvt. Ltd.

08.2021 - 03.2023

Analyzed and interpreted business requirements by engaging with stakeholders to define data needs, scope, functional expectations, and success criteria for both on-premise and AWS cloud data processing workflows.
Designed and implemented optimized schema structures for Hive tables, incorporating partitioning and data modeling best practices to support scalable querying and efficient storage for large datasets.
Performed comprehensive performance tuning of Spark and Hive jobs, including query optimization, efficient join strategies, resource configuration tuning, and partition pruning, to improve execution performance and reduce ETL runtime.
Developed and maintained Spark framework XML configuration files containing embedded SQL logic to perform ETL from source systems to target data stores, enabling consistent and modular data transformation processes.
Executed complex logic fixes and enhancements to align data processing with evolving business requirements, troubleshooting logic discrepancies, debugging transformation scripts, and validating results against expected outcomes.
Enhanced Spark/SQL logic to incorporate new business attributes, expanding Hive table structures, and ensuring downstream systems could leverage enriched datasets for analytics and reporting.
Validated data accuracy in new parallel environments, executing rigorous data reconciliation between legacy production and newly provisioned environments to ensure consistency, integrity, and readiness for cutover.
Transformed traditional batch reporting pipelines into near real-time streaming solutions by integrating Apache Kafka with Spark Structured Streaming, enabling continuous processing of data streams, and reducing data latency for key business dashboards.
Leveraged AWS cloud services to process Spark jobs at scale, migrating on-premise Spark/Scala workloads to AWS by orchestrating PySpark jobs on Amazon EMR clusters, and utilizing AWS S3 for staging and persistent storage of intermediate and final datasets.
Developed data processing solutions on AWS that automated PySpark job execution, incorporating cloud storage, compute orchestration, and scalable resource provisioning to align with modern data architecture standards.

Software Engineer

Achala IT Solutions

05.2015 - 08.2021

Gathered and analyzed business and technical requirements by collaborating closely with onshore stakeholders to design scalable data solutions.
Designed and implemented efficient schema models for MariaDB and Apache Hive, optimized for large-scale analytical workloads.
Developed and optimized Apache Spark applications to build and manage customer base datasets using distributed processing.
Implemented Spark Core and Spark SQL programs to ingest data from the DBS Data Lake, applying complex business logic transformations.
Optimized complex SQL queries using Common Table Expressions (CTEs) to improve query performance and readability.
Worked extensively across multiple Data Lake layers (Raw, Curated, and Consumption layers) to perform cleansing, enrichment, and transformations.
Scheduled and monitored batch data-load jobs using TWS (Tivoli Workload Scheduler) to ensure timely and reliable execution.
Conducted performance benchmarking and comparison across Spark, Hive, and traditional SQL workloads to identify optimal processing strategies.
Improved data processing efficiency by leveraging Spark SQL optimizations, in-memory computation, and distributed execution.
Imported data into Spark RDDs, performing transformations and actions to support complex analytical use cases.
Developed Spark and Hive queries involving joins, aggregations, summations, and analytical functions to support reporting and insights.
Processed JSON data using Spark SQL by creating Schema RDDs and loading structured data into Hive tables.
I wrote and maintained Sqoop scripts for initial and incremental data loads from RDBMS (MySQL) systems into HDFS.
Performed secure and efficient data transfers from Unix servers to HDFS using native HDFS commands.
Automated Sqoop ingestion pipelines using shell scripting and Oozie workflows to ensure repeatable and fault-tolerant ETL processes.
Designed and implemented Hive managed and external tables, including partitioning, bucketing, indexes, and views, to optimize query performance.
Hands-on experience with multiple data formats, such as Text, CSV, Parquet, Avro, and ORC, selecting appropriate formats based on use cases.
Developed and executed Spark SQL scripts using Spark Shell to process and transform Hive-based datasets.
Automated Spark Shell jobs using Autosys job scheduling, enabling hands-free production deployments.
Worked extensively with both structured and semi-structured data across large, distributed environments.
Created and managed Sqoop (v1.4.3) jobs with incremental load strategies to populate Hive external tables efficiently.
Optimized Hive and Sqoop scripts to significantly reduce ETL job execution time, and improve system throughput.
Developed Hive (v1.2.0) scripts to support ad-hoc analysis and reporting requirements for business users, and analysts.
Demonstrated strong expertise in Hive partitioning and bucketing strategies, ensuring optimal storage layout and query performance.
Utilized the ORC file format to achieve better compression, faster query execution, and improved storage efficiency.
Designed and implemented Oozie workflows to orchestrate end-to-end ETL pipelines, including Spark, Hive, and Sqoop jobs.

Education

MCA -

Madurai Kamaraj University

01.2015

BCA - undefined

IGNOU

01.2011

12th - undefined

St Joseph's public school

Dalsingsarai

01.2007

10th - undefined

St Joseph's public school

Dalsingsarai

01.2005

Skills

Big Data Ecosystem: Spark, PySpark, Hive, Kafka, Impala, HBase, AWS

Programming Languages: Scala, Python

Databases: MySQL, Oracle, MS SQL

Interests

Listening Music, Cricket, Learning New Technology

Timeline

Data Engineer

Cognizant

03.2023 - Current

Data Engineer

IBM India Pvt. Ltd.

08.2021 - 03.2023

Software Engineer

Achala IT Solutions

05.2015 - 08.2021

BCA - undefined

IGNOU

12th - undefined

St Joseph's public school

10th - undefined

St Joseph's public school

MCA -

Madurai Kamaraj University

Language Known

English

Hindi

Disclaimer

Date of Birth: 1990-06-02 Father’s Name: Mr. Dinesh Chandra Mothers Name: Smt. Rajani Chandra Date: 2022-07-12 (Piyush Chandra)