Summary
Overview
Work History
Education
Skills
Certification
Personal Information
Disclaimer
Timeline
Generic

Rahul Kumar Jaiswal

Gurgaon,HR

Summary

  • To obtain a challenging position in a high quality engineering environment where my resourceful experience and academic skills will add value to organizational operations.
  • 6 years of High Performance Computing administration experience.
  • Excellent knowledge in designing, prototyping and deploying HPC clusters.
  • Strong understanding of cluster resource managers, job schedulers, clusterware and GPU computing.
  • Experience in benchmarking and performance optimization of large-scale HPC systems.
  • Experience in installing and managing high performance storage and network interconnects.
  • Extensive experience in developing Linux installers for cluster software and OS deployment and automation.
  • Experience in building computer labs ground up, capacity planning and installing racks.
  • Extensive experience in troubleshooting Linux OS, filesystems, cluster hardware, scripting and GPU computing hardware.
  • Ability to create, maintain and implement scripts in order to reduce administrative efforts.
  • Ability to operate in multi-platform, multi operating system, multi-component environment utilizing a large number of server builds and configurations.
  • Experienced in project management skills and the demonstrated ability to drive for results.
  • Excellent interpersonal, communication, customer interaction, documentation skills and design making ability.
  • Experienced in organising and delivering the HPC systems related workshops to the users.

Overview

7
7
years of professional experience
1
1
Certification

Work History

HPC Engineer

Graviton Research Capital LLP
Gurgaon
05.2023 - Current

Description:

In my role within HPC environments, I specialize in configuring, automating, and managing complex IT infrastructures. Utilizing tools such as Foreman, Ansible, GitLab, Semaphore, and the ELK stack, I optimize and maintain Slurm and Weka file systems. My focus includes supporting end-users and ensuring seamless operation of critical computational resources.

  • Managed node configuration and administration using Foreman.
  • Created Ansible roles for package installations, Infiniband drivers installation and configuration, Weka file system installation, Slurm workload manager installation and configuration.
  • Automated configuration tasks post-OS installation using Ansible.
  • Implemented and configured Slurm features including local job priority management, multi-cluster setup, and enhanced job visibility with custom Slurm commands.
  • Configured GitLab for managing development tasks, including repository structures and CI/CD pipelines.
  • Installed, configured, and managed Semaphore for automating Ansible roles and integrated it with GitLab for continuous deployment and automation.
  • Installed, configured, and managed the ELK stack (Elasticsearch, Logstash, Kibana) for centralized log management and created dashboards using Metricbeat and Filebeat for system-level metrics monitoring.
  • Provided HPC user support, addressing Slurm job issues, implementing new features in Slurm, and resolving file system issues.
  • Upgraded the Weka file system to enhance performance and reliability.

Technical Lead HPC System

Tata Consultancy Services
Pune
06.2022 - 05.2023

Description:

Highly skilled HPC (High-Performance Computing) Specialist with extensive experience in configuring, managing, and optimizing HPC environments both on-premises and in the cloud. Proven expertise in deploying and maintaining complex HPC systems, including AI-enhanced clusters, with a strong focus on automation and user support. Proficient in compiling and optimizing scientific applications, managing containerized workloads, and leveraging advanced GPU technologies.

  • Configured HPC environments on AWS using AWS ParallelCluster.
  • Developed TCS's custom HPC infrastructure with integrated AI tools.
  • Compiled and optimized LAMMPS and GROMACS applications.
  • Managed Docker and Kubernetes user jobs using Rancher.
  • Managed user jobs with the SLURM job scheduler.
  • Assisted users with job failures, job scheduler issues, and Linux queries.
  • Created automation scripts for daily admin tasks and cluster management.
  • Managed AI-related HPC workloads using NVIDIA H100 DGX systems.
  • Configured NVIDIA MIG for multiple users on the same GPU card.
  • Conducted HPC Challenge Benchmark for performance measurement.
  • Automated benchmarking processes with Ansible.

HPC System Engineer

Hewlett Packard Enterprise
01.2021 - 05.2022

Description:

The National Centre for Medium Range Weather Forecasting (NCMRWF) is a Centre of Excellence in Weather and Climate Modelling under the Ministry of Earth Sciences. Here, the HPC facility name is MIHIR with Rpeak of around 3 Petaflop. It is a Liquid Cooled system which provides a balanced and high-performance computing platform along with service nodes, login nodes, compute nodes and I/O nodes with Cray Aries HSN interconnects. It consists of 13 cabinets, in which each cabinet has 3 chassis, each chassis has 16 slots, and each slot has four nodes, in total it has 2322 nodes.

  • Maintaining and managing the Cray XC40 HPC system, Cray CLS300 storage, PBS and Spectra Tape Library.
  • Managing users account using Bright Cluster Computing.
  • Troubleshooting the Hardware related failures in the Cray servers, storage, Tape Library, etc.
  • Deployed the 10PF HPC system at one of the government site.
  • Configured and managed the Cray Storage System E1000.
  • Assisting the users with their queries related to their HPC jobs failure, Job scheduler, Linux, etc.
  • Creating and Managing the PBS queues for the HPC production jobs.
  • Monitoring servers and services using NAGIOS.
  • Time to time system upgrade activity whenever new patches came.
  • Monitoring and managing the complete HPC environment like not only the HPC servers but the chiller plant, batteries, DGs, etc.
  • Delivered HPC training to the new HPC users.
  • Implementing the bash scripts to automate the day-to-day HPC system related tasks.
  • Creating the weekly, monthly and yearly reports as per the system status and submitted to the client.
  • Coordinating with internal teams for troubleshooting the production issues.
  • Designed, tested and deployed scripts for complex/backup restore procedures for critical customer data stores in the Lustre Files System.
  • Running of IOR and MDTest for getting the file system performance.
  • Received valuable training and hands-on on Cray XC systems, HPE CMU, CRAY Storages, etc., from CRAY HPE.

HPC System Engineer

Centre for Development of Advanced Computing
02.2017 - 01.2021

Description:

HPC System Administrator of IIT Delhi HPC Facility(PADUM). Handling such a big cluster of Rpeak around 2 Petaflop having more than 17000 CPU cores along with 234 Nvidia GPU K40 and 40 Nvidia GPU V100 GPU cards.

  • Installation and Configuration of OpenHPC (Community building blocks for HPC systems).
  • Installed and configured openldap server (integrated with the Kerberos) and client on different nodes of testbed cluster Param Sangam.
  • Installed and configured the PBS and SLURM job scheduler for jobs scheduling.
  • Hardware Inspection & acceptance testing for various SBI Data Centers of Hyderabad and Mumbai.
  • Hardware Inspection of testbed cluster PARAM SANGAM.
  • Delivered presentation on installation and configuration of Beowulf Cluster and Openldap.
  • Assist in the design, implementation, management and support of enterprise HPC cluster solutions in a highly complex, high-performance, low-latency environment.
  • Implemented the HPC Infrastructure Management script for checking the amount of load percentage on HPC gateway server at IIT Delhi
  • Delivered effective lectures on Linux and HPC every month in workshops at IIT Delhi.
  • Created the wrapper script which contains different scripts which are responsible for budget allocation.
  • Allocating the quarter wise budget to all the departments of IIT Delhi who are using HPC facility.
  • Perform troubleshooting and root cause analysis of HPC cluster and file system related issues.
  • Maintaining the IIT-Delhi Supercomputing website.
  • Responsible for developing customized system documentation and technical training for customer.
  • Implemented node provisioning through HPE CMU.
  • Created Bash shell scripts to monitor resources and system maintenance.
  • Maintained network and data security, maintained security compliance policies on the OS.
  • Proactively involved with multiple OEMs to get the latest update on Firmware / bug fixes to apply on the running system with minimum downtime.
  • High Performance Linpack benchmarks runs to check the system efficiency.
  • Installation & maintenance of HPC Applications/Compilers/Libraries etc. as per the user’s request.
  • Implemented and maintained the user quota management and resource allocation.
  • Installed and configured the HPE ILO on nodes to check the hardware health of the servers.
  • Installed and configured the PBS project budgeting policy with PBS Allocation Manager.
  • Uses the PBS Analytics to generate the graph related to no. of jobs, users, queues, nodes, etc

Education

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Skills

    HPC

    Slurm

    PBS

    Linux

    NVIDIA GPUs

    Bright Cluster Computing

    OpenHPC

    xCAT

    CRAY Storage CLS300 and E1000

    Infiniband

    AWS

    Kubernetes

    Docker

    Ansible

    GIT

    Semaphore

    ELK

    Zabbix

    Nagios

    Ganglia

    BASH Scripting

    Cray XC40, XC50 Cluster

    Lustre File System

    Weka File System

Certification

  • Oracle Cloud Infrastructure Foundations 2021 Associate
  • IT Security: Defense against the digital dark arts
  • ICSI | CNSS Certified Network Security Specialist
  • Post Graduate Diploma in IT Infrastructure, Systems and Security (PG-DITISS)
  • NSE 1 Network Security Associate

Personal Information

  • Passport Number: Z62 2007
  • Date of Birth: 10/15/94
  • Gender: Male
  • Nationality: Indian

Disclaimer

I hereby declare that the above-mentioned information is correct up to my knowledge and I bear the responsibility for the correctness of the above-mentioned particulars.

Timeline

HPC Engineer

Graviton Research Capital LLP
05.2023 - Current

Technical Lead HPC System

Tata Consultancy Services
06.2022 - 05.2023

HPC System Engineer

Hewlett Packard Enterprise
01.2021 - 05.2022

HPC System Engineer

Centre for Development of Advanced Computing
02.2017 - 01.2021

Post Graduate Diploma in IT Infrastructure Systems and Security (PG-DITISS), Centre for Development of Advanced Computing -

Bachelor Of Technology (Information Technology), I.E.T, Dr Ram Manohar Lohia Avadh University, Ayodhya (Faizabad) -

Intermediate (Mathematics,Physics & Chemistry), Bishop George School and College -

SSC (Secondary School Certificate), Bishop George School and College -

Rahul Kumar Jaiswal