Summary
Overview
Work History
Education
Skills
Certification
Timeline
Personal Information
Disclaimer
Generic

Rahul Kumar Jaiswal

Gurgaon

Summary

  • To obtain a challenging position in a high quality engineering environment where my resourceful experience and academic skills will add value to organizational operations.
  • 9.5 years of High Performance Computing administration experience.
  • Excellent knowledge in designing, prototyping and deploying HPC clusters.
  • Strong understanding of cluster resource managers, job schedulers, File System and GPU computing.
  • Experience in benchmarking and performance optimization of large-scale HPC systems.
  • Experience in installing and managing high performance storage and network interconnects.
  • Extensive experience in developing Linux installers for cluster software and OS deployment and automation.
  • Experience in building computer labs ground up, capacity planning and installing racks.
  • Extensive experience in troubleshooting Linux OS, filesystems, cluster hardware, scripting and GPU computing hardware.
  • Ability to create, maintain and implement scripts in order to reduce administrative efforts.
  • Ability to operate in multi-platform, multi operating system, multi-component environment utilizing a large number of server builds and configurations.
  • Experienced in project management skills and the demonstrated ability to drive for results.
  • Excellent interpersonal, communication, customer interaction, documentation skills and design making ability.
  • Experienced in organizing and delivering the HPC systems related workshops to the users.

Overview

9
9
years of professional experience
1
1
Certification

Work History

Senior AI-HPC Cluster Engineer

NVIDIA
Bengaluru, India
05.2025 - Current
  • Manage and operate large-scale GPU-based AI/HPC clusters across global data centers ensuring high uptime and efficiency.
  • Administer Slurm and Bright Cluster Manager (BCM) environments, optimize scheduling policies, and maintain Lua-based job submission validations.
  • Lead GPU utilization and waste-reduction initiatives, automating tracking and enforcement, reducing idle GPU hours significantly.
  • Troubleshoot large-scale training jobs, NCCL communication, I/O bottlenecks, and Slurm job hangs using Elasticsearch and OpenSearch telemetry.
  • Manage Lustre and WekaFS quotas, coordinating with storage teams to handle expansions, stale data cleanup, and inode optimization.
  • Develop automation tools in Python and Bash for job diagnostics, resource monitoring, and policy enforcement.
  • Collaborate with global teams to standardize cluster configurations and support incident response and root-cause analysis (RCA) processes.

HPC Engineer

Graviton Research Capital LLP
Gurgaon
05.2023 - 05.2025

Description:

In my role within HPC environments, I specialize in configuring, automating, and managing complex IT infrastructures. Utilizing tools such as Foreman, Ansible, GitLab, Semaphore, and the ELK stack, I optimize and maintain Slurm, and Weka file systems. My focus includes supporting end-users and ensuring seamless operation of critical computational resources.

  • Managing and optimizing a 2.5 lakh core HPC cluster with approximately 18 PB of storage, ensuring high availability and performance for critical computational workloads.
  • Managed node configuration and administration using Foreman.
  • Created Ansible roles for package installations, Infiniband drivers installation and configuration, Weka file system installation, Slurm workload manager installation and configuration.
  • Automated configuration tasks post-OS installation using Ansible.
  • Implemented and configured Slurm features including local job priority management, multi-cluster setup, and enhanced job visibility with custom Slurm commands.
  • Configured GitLab for managing development tasks, including repository structures and CI/CD pipelines.
  • Installed, configured, and managed Semaphore for automating Ansible roles and integrated it with GitLab for continuous deployment and automation.
  • Installed, configured, and managed the ELK stack (Elasticsearch, Logstash, Kibana) for centralized log management and created dashboards using Metricbeat and Filebeat for system-level metrics monitoring.
  • Provided HPC user support, addressing Slurm job issues, implementing new features in Slurm, and resolving file system issues.
  • Upgraded the Weka file system to enhance performance and reliability.

Technical Lead HPC System

Tata Consultancy Services
Pune
06.2022 - 05.2023

Description:

Highly skilled HPC (High-Performance Computing) Specialist with extensive experience in configuring, managing, and optimizing HPC environments both on-premises and in the cloud. Proven expertise in deploying and maintaining complex HPC systems, including AI-enhanced clusters, with a strong focus on automation and user support. Proficient in compiling and optimizing scientific applications, managing containerized workloads, and leveraging advanced GPU technologies.

  • Configured HPC environments on AWS using AWS ParallelCluster.
  • Developed TCS's custom HPC infrastructure with integrated AI tools.
  • Compiled and optimized LAMMPS and GROMACS applications.
  • Managed Docker and Kubernetes user jobs using Rancher.
  • Managed user jobs with the SLURM job scheduler.
  • Assisted users with job failures, job scheduler issues, and Linux queries.
  • Created automation scripts for daily admin tasks and cluster management.
  • Managed AI-related HPC workloads using NVIDIA H100 DGX systems.
  • Configured NVIDIA MIG for multiple users on the same GPU card.
  • Conducted HPC Challenge Benchmark for performance measurement.
  • Automated benchmarking processes with Ansible.

HPC System Engineer

Hewlett Packard Enterprise
01.2021 - 05.2022

Description:

The National Centre for Medium Range Weather Forecasting (NCMRWF) is a Centre of Excellence in Weather and Climate Modelling under the Ministry of Earth Sciences. Here, the HPC facility name is MIHIR with Rpeak of around 3 Petaflop. It is a Liquid Cooled system which provides a balanced and high-performance computing platform along with service nodes, login nodes, compute nodes and I/O nodes with Cray Aries HSN interconnects. It consists of 13 cabinets, in which each cabinet has 3 chassis, each chassis has 16 slots, and each slot has four nodes, in total it has 2322 nodes.

  • Maintaining and managing the Cray XC40 HPC system, Cray CLS300 storage, PBS and Spectra Tape Library.
  • Managing users account using Bright Cluster Computing.
  • Troubleshooting the Hardware related failures in the Cray servers, storage, Tape Library, etc.
  • Deployed the 10PF HPC system at one of the government site.
  • Configured and managed the Cray Storage System E1000.
  • Assisting the users with their queries related to their HPC jobs failure, Job scheduler, Linux, etc.
  • Creating and Managing the PBS queues for the HPC production jobs.
  • Monitoring servers and services using NAGIOS.
  • Time to time system upgrade activity whenever new patches came.
  • Monitoring and managing the complete HPC environment like not only the HPC servers but the chiller plant, batteries, DGs, etc.
  • Delivered HPC training to the new HPC users.
  • Implementing the bash scripts to automate the day-to-day HPC system related tasks.
  • Creating the weekly, monthly and yearly reports as per the system status and submitted to the client.
  • Coordinating with internal teams for troubleshooting the production issues.
  • Designed, tested and deployed scripts for complex/backup restore procedures for critical customer data stores in the Lustre Files System.
  • Running of IOR and MDTest for getting the file system performance.
  • Received valuable training and hands-on on Cray XC systems, HPE CMU, CRAY Storages, etc., from CRAY HPE.

HPC System Engineer

Centre for Development of Advanced Computing
02.2017 - 01.2021

Description:

HPC System Administrator of IIT Delhi HPC Facility(PADUM). Handling such a big cluster of Rpeak around 2 Petaflop having more than 17000 CPU cores along with 234 Nvidia GPU K40 and 40 Nvidia GPU V100 GPU cards.

  • Installation and Configuration of OpenHPC (Community building blocks for HPC systems).
  • Installed and configured openldap server (integrated with the Kerberos) and client on different nodes of testbed cluster Param Sangam.
  • Installed and configured the PBS and SLURM job scheduler for jobs scheduling.
  • Hardware Inspection & acceptance testing for various SBI Data Centers of Hyderabad and Mumbai.
  • Hardware Inspection of testbed cluster PARAM SANGAM.
  • Delivered presentation on installation and configuration of Beowulf Cluster and Openldap.
  • Assist in the design, implementation, management and support of enterprise HPC cluster solutions in a highly complex, high-performance, low-latency environment.
  • Implemented the HPC Infrastructure Management script for checking the amount of load percentage on HPC gateway server at IIT Delhi
  • Delivered effective lectures on Linux and HPC every month in workshops at IIT Delhi.
  • Created the wrapper script which contains different scripts which are responsible for budget allocation.
  • Allocating the quarter wise budget to all the departments of IIT Delhi who are using HPC facility.
  • Perform troubleshooting and root cause analysis of HPC cluster and file system related issues.
  • Maintaining the IIT-Delhi Supercomputing website.
  • Responsible for developing customized system documentation and technical training for customer.
  • Implemented node provisioning through HPE CMU.
  • Created Bash shell scripts to monitor resources and system maintenance.
  • Maintained network and data security, maintained security compliance policies on the OS.
  • Proactively involved with multiple OEMs to get the latest update on Firmware / bug fixes to apply on the running system with minimum downtime.
  • High Performance Linpack benchmarks runs to check the system efficiency.
  • Installation & maintenance of HPC Applications/Compilers/Libraries etc. as per the user’s request.
  • Implemented and maintained the user quota management and resource allocation.
  • Installed and configured the HPE ILO on nodes to check the hardware health of the servers.
  • Installed and configured the PBS project budgeting policy with PBS Allocation Manager.
  • Uses the PBS Analytics to generate the graph related to no. of jobs, users, queues, nodes, etc

Education

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Skills

  • HPC
  • Slurm, PBS
  • NVIDIA GPUs
  • Linux
  • Bright Cluster Computing
  • OpenHPC with xCAT
  • Infiniband
  • AWS
  • Kubernetes
  • Docker
  • Ansible
  • GIT
  • Semaphore
  • ELK
  • Zabbix, Nagios, Ganglia
  • BASH Scripting
  • Cray XC40, XC50 Cluster
  • Lustre File System, Weka File System

Certification

  • Oracle Cloud Infrastructure Foundations 2021 Associate
  • IT Security: Defense against the digital dark arts
  • ICSI | CNSS Certified Network Security Specialist
  • Post Graduate Diploma in IT Infrastructure, Systems and Security (PG-DITISS)
  • NSE 1 Network Security Associate

Timeline

Senior AI-HPC Cluster Engineer

NVIDIA
05.2025 - Current

HPC Engineer

Graviton Research Capital LLP
05.2023 - 05.2025

Technical Lead HPC System

Tata Consultancy Services
06.2022 - 05.2023

HPC System Engineer

Hewlett Packard Enterprise
01.2021 - 05.2022

HPC System Engineer

Centre for Development of Advanced Computing
02.2017 - 01.2021

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Personal Information

  • Passport Number: Z62 2007
  • Date of Birth: 10/15/94
  • Gender: Male
  • Nationality: Indian

Disclaimer

I hereby declare that the above-mentioned information is correct up to my knowledge and I bear the responsibility for the correctness of the above-mentioned particulars.
Rahul Kumar Jaiswal