Summary

Overview

Work History

Education

Skills

Certification

Timeline

Personal Information

Disclaimer

Rahul Kumar Jaiswal

Gurgaon

Summary

To obtain a challenging position in a high quality engineering environment where my resourceful experience and academic skills will add value to organizational operations.
9.5 years of High Performance Computing administration experience.
Excellent knowledge in designing, prototyping and deploying HPC clusters.
Strong understanding of cluster resource managers, job schedulers, File System and GPU computing.
Experience in benchmarking and performance optimization of large-scale HPC systems.
Experience in installing and managing high performance storage and network interconnects.
Extensive experience in developing Linux installers for cluster software and OS deployment and automation.
Experience in building computer labs ground up, capacity planning and installing racks.
Extensive experience in troubleshooting Linux OS, filesystems, cluster hardware, scripting and GPU computing hardware.
Ability to create, maintain and implement scripts in order to reduce administrative efforts.
Ability to operate in multi-platform, multi operating system, multi-component environment utilizing a large number of server builds and configurations.
Experienced in project management skills and the demonstrated ability to drive for results.
Excellent interpersonal, communication, customer interaction, documentation skills and design making ability.
Experienced in organizing and delivering the HPC systems related workshops to the users.

Overview

years of professional experience

Certification

Work History

Senior AI-HPC Cluster Engineer

NVIDIA

Bengaluru, India

05.2025 - Current

Manage and operate large-scale GPU-based AI/HPC clusters across global data centers ensuring high uptime and efficiency.
Administer Slurm and Bright Cluster Manager (BCM) environments, optimize scheduling policies, and maintain Lua-based job submission validations.
Lead GPU utilization and waste-reduction initiatives, automating tracking and enforcement, reducing idle GPU hours significantly.
Troubleshoot large-scale training jobs, NCCL communication, I/O bottlenecks, and Slurm job hangs using Elasticsearch and OpenSearch telemetry.
Manage Lustre and WekaFS quotas, coordinating with storage teams to handle expansions, stale data cleanup, and inode optimization.
Develop automation tools in Python and Bash for job diagnostics, resource monitoring, and policy enforcement.
Collaborate with global teams to standardize cluster configurations and support incident response and root-cause analysis (RCA) processes.

HPC Engineer

Graviton Research Capital LLP

Gurgaon

05.2023 - 05.2025

Description:

In my role within HPC environments, I specialize in configuring, automating, and managing complex IT infrastructures. Utilizing tools such as Foreman, Ansible, GitLab, Semaphore, and the ELK stack, I optimize and maintain Slurm, and Weka file systems. My focus includes supporting end-users and ensuring seamless operation of critical computational resources.

Managing and optimizing a 2.5 lakh core HPC cluster with approximately 18 PB of storage, ensuring high availability and performance for critical computational workloads.
Managed node configuration and administration using Foreman.
Created Ansible roles for package installations, Infiniband drivers installation and configuration, Weka file system installation, Slurm workload manager installation and configuration.
Automated configuration tasks post-OS installation using Ansible.
Implemented and configured Slurm features including local job priority management, multi-cluster setup, and enhanced job visibility with custom Slurm commands.
Configured GitLab for managing development tasks, including repository structures and CI/CD pipelines.
Installed, configured, and managed Semaphore for automating Ansible roles and integrated it with GitLab for continuous deployment and automation.
Installed, configured, and managed the ELK stack (Elasticsearch, Logstash, Kibana) for centralized log management and created dashboards using Metricbeat and Filebeat for system-level metrics monitoring.
Provided HPC user support, addressing Slurm job issues, implementing new features in Slurm, and resolving file system issues.
Upgraded the Weka file system to enhance performance and reliability.

Technical Lead HPC System

Tata Consultancy Services

Pune

06.2022 - 05.2023

Description:

Highly skilled HPC (High-Performance Computing) Specialist with extensive experience in configuring, managing, and optimizing HPC environments both on-premises and in the cloud. Proven expertise in deploying and maintaining complex HPC systems, including AI-enhanced clusters, with a strong focus on automation and user support. Proficient in compiling and optimizing scientific applications, managing containerized workloads, and leveraging advanced GPU technologies.

Configured HPC environments on AWS using AWS ParallelCluster.
Developed TCS's custom HPC infrastructure with integrated AI tools.
Compiled and optimized LAMMPS and GROMACS applications.
Managed Docker and Kubernetes user jobs using Rancher.
Managed user jobs with the SLURM job scheduler.
Assisted users with job failures, job scheduler issues, and Linux queries.
Created automation scripts for daily admin tasks and cluster management.
Managed AI-related HPC workloads using NVIDIA H100 DGX systems.
Configured NVIDIA MIG for multiple users on the same GPU card.
Conducted HPC Challenge Benchmark for performance measurement.
Automated benchmarking processes with Ansible.

HPC System Engineer

Hewlett Packard Enterprise

01.2021 - 05.2022

Description:

The National Centre for Medium Range Weather Forecasting (NCMRWF) is a Centre of Excellence in Weather and Climate Modelling under the Ministry of Earth Sciences. Here, the HPC facility name is MIHIR with Rpeak of around 3 Petaflop. It is a Liquid Cooled system which provides a balanced and high-performance computing platform along with service nodes, login nodes, compute nodes and I/O nodes with Cray Aries HSN interconnects. It consists of 13 cabinets, in which each cabinet has 3 chassis, each chassis has 16 slots, and each slot has four nodes, in total it has 2322 nodes.

Maintaining and managing the Cray XC40 HPC system, Cray CLS300 storage, PBS and Spectra Tape Library.
Managing users account using Bright Cluster Computing.
Troubleshooting the Hardware related failures in the Cray servers, storage, Tape Library, etc.
Deployed the 10PF HPC system at one of the government site.
Configured and managed the Cray Storage System E1000.
Assisting the users with their queries related to their HPC jobs failure, Job scheduler, Linux, etc.
Creating and Managing the PBS queues for the HPC production jobs.
Monitoring servers and services using NAGIOS.
Time to time system upgrade activity whenever new patches came.
Monitoring and managing the complete HPC environment like not only the HPC servers but the chiller plant, batteries, DGs, etc.
Delivered HPC training to the new HPC users.
Implementing the bash scripts to automate the day-to-day HPC system related tasks.
Creating the weekly, monthly and yearly reports as per the system status and submitted to the client.
Coordinating with internal teams for troubleshooting the production issues.
Designed, tested and deployed scripts for complex/backup restore procedures for critical customer data stores in the Lustre Files System.
Running of IOR and MDTest for getting the file system performance.
Received valuable training and hands-on on Cray XC systems, HPE CMU, CRAY Storages, etc., from CRAY HPE.

HPC System Engineer

Centre for Development of Advanced Computing

02.2017 - 01.2021

Description:

HPC System Administrator of IIT Delhi HPC Facility(PADUM). Handling such a big cluster of Rpeak around 2 Petaflop having more than 17000 CPU cores along with 234 Nvidia GPU K40 and 40 Nvidia GPU V100 GPU cards.

Installation and Configuration of OpenHPC (Community building blocks for HPC systems).
Installed and configured openldap server (integrated with the Kerberos) and client on different nodes of testbed cluster Param Sangam.
Installed and configured the PBS and SLURM job scheduler for jobs scheduling.
Hardware Inspection & acceptance testing for various SBI Data Centers of Hyderabad and Mumbai.
Hardware Inspection of testbed cluster PARAM SANGAM.
Delivered presentation on installation and configuration of Beowulf Cluster and Openldap.
Assist in the design, implementation, management and support of enterprise HPC cluster solutions in a highly complex, high-performance, low-latency environment.
Implemented the HPC Infrastructure Management script for checking the amount of load percentage on HPC gateway server at IIT Delhi
Delivered effective lectures on Linux and HPC every month in workshops at IIT Delhi.
Created the wrapper script which contains different scripts which are responsible for budget allocation.
Allocating the quarter wise budget to all the departments of IIT Delhi who are using HPC facility.
Perform troubleshooting and root cause analysis of HPC cluster and file system related issues.
Maintaining the IIT-Delhi Supercomputing website.
Responsible for developing customized system documentation and technical training for customer.
Implemented node provisioning through HPE CMU.
Created Bash shell scripts to monitor resources and system maintenance.
Maintained network and data security, maintained security compliance policies on the OS.
Proactively involved with multiple OEMs to get the latest update on Firmware / bug fixes to apply on the running system with minimum downtime.
High Performance Linpack benchmarks runs to check the system efficiency.
Installation & maintenance of HPC Applications/Compilers/Libraries etc. as per the user’s request.
Implemented and maintained the user quota management and resource allocation.
Installed and configured the HPE ILO on nodes to check the hardware health of the servers.
Installed and configured the PBS project budgeting policy with PBS Allocation Manager.
Uses the PBS Analytics to generate the graph related to no. of jobs, users, queues, nodes, etc

Education

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Skills

HPC
Slurm, PBS
NVIDIA GPUs
Linux
Bright Cluster Computing
OpenHPC with xCAT
Infiniband
AWS
Kubernetes

Docker
Ansible
GIT
Semaphore
ELK
Zabbix, Nagios, Ganglia
BASH Scripting
Cray XC40, XC50 Cluster
Lustre File System, Weka File System

Certification

Oracle Cloud Infrastructure Foundations 2021 Associate
IT Security: Defense against the digital dark arts
ICSI | CNSS Certified Network Security Specialist
Post Graduate Diploma in IT Infrastructure, Systems and Security (PG-DITISS)
NSE 1 Network Security Associate

Timeline

Senior AI-HPC Cluster Engineer

NVIDIA

05.2025 - Current

HPC Engineer

Graviton Research Capital LLP

05.2023 - 05.2025

Technical Lead HPC System

Tata Consultancy Services

06.2022 - 05.2023

HPC System Engineer

Hewlett Packard Enterprise

01.2021 - 05.2022

HPC System Engineer

Centre for Development of Advanced Computing

02.2017 - 01.2021

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Personal Information

Passport Number: Z62 2007
Date of Birth: 10/15/94
Gender: Male
Nationality: Indian

Disclaimer

I hereby declare that the above-mentioned information is correct up to my knowledge and I bear the responsibility for the correctness of the above-mentioned particulars.

Rahul Kumar Jaiswal

Summary

Overview

Work History

Senior AI-HPC Cluster Engineer

HPC Engineer

Technical Lead HPC System

HPC System Engineer

HPC System Engineer

Education

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Skills

Certification

Timeline

Senior AI-HPC Cluster Engineer

HPC Engineer

Technical Lead HPC System

HPC System Engineer

HPC System Engineer

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Personal Information

Disclaimer

Similar Profiles

Gagan UGagan U

Rahul Kumar JaiswalRahul Kumar Jaiswal

Asma Sultan Al NeyadiAsma Sultan Al Neyadi

Sunil SarafSunil Saraf