Software Engineer with ~9 years exp. in backend and distributed systems development. Expertise in cloud hosted Databases , distributed computing, PostgreSQL infrastructure, docker containers ,
optimizing DB queries , designing REST APIs , Linux kernel & server OS/security upgrades. Hands on experience of Kafka messaging
platform & NoSQL. Experienced in Java backend using Spring Boot. Proficient at development and maintenance of Machine Learning
pipelines using Apache spark and MongoDB , with considerable experience in handling NLP solutions and recommendation systems.
Time Machine Health Engine:
Implemented an engine that executes on a Nutanix server periodically to evaluate the health of the backup systems for 5000 databases every 5 mins. The engine detects any gaps in the recoverability timeline and raises alerts for databases that are down/ in critical state. This helps customer detects issues with their databases in near real time and address them quickly.
RCA Engine: Implemented an LLM based root cause analysis pipeline for NDB backup and recovery systems.
Involves finetuning LLM model on existing RCA data from JIRA, SRE documentation etc.the pipeline ingests VM logs, Preprocesses text and uses local finetuned model to suggest root cause for the failure event, reducing MTTR by 40% and giving SREs and DBAs quick, actionable explanations without digging through complex logs.
MongoDB sharded cluster Backup/Recovery :
Redesigned the current backup/recovery feature to assimilate periodical backups of MongoDB sharded cluster every 10 mins through a scheduler engine , to synchronize the backup schedule & timeline across the clusters and to be able to recover failed nodes and synchronize the data across the cluster, while allowing the scaling up / down operations.
Company Overview: Microsoft Azure
RolloutManagerV2 : Developed a scalable,distributed Upgrade Management Service that
schedules & executes monthly upgrade operation on ~ 100k Azure PostgreSQL servers . It
executes container & OS package upgrades on the acquired servers , redirects logs to the
telemetry storage pipeline, calculates the upgrade metrics (Grafana dashboards), handles the load
distribution to worker nodes while being fault tolerant.
OS upgrades for Azure servers: Developed an automated scalable finite state machine that
detects outdated Linux packages(debian or RPM) on azure VMs and upgrades them to latest
version which is critical to address bugs, security issues, introduce new features etc.
Optimize OS upgrades: Developed feature to pre-download debian packages to the VMs which
drastically reduced the OS upgrade execution time, thus reducing the overall upgrade run time by
50% . Stabilized the code to handle major linux kernel upgrades on VMs running myriads of
different kernel versions.Leveraged needRestart to ascertain & restart affected services post
upgrade & avoid reboot,reducing avg. downtime from ~3.5 mins to 2 mins.
PostgreSQL Image Upgrade:Designed & implemented a feature to upgrade the PG container
running on the server .The process involves safely shutting down the PostgreSQL engine on azure
VMs while handling WAL upload, DB checkpoints, standby sync etc and safely terminating PG
background jobs, then starting the new container.It is critical to guaranteeing data consistency
and minimizing PG recovery time which reduces DB downtime for customers during maintenance.
ML COURSE RECOMMENDATIONS: Integrated and worked on end to end machine learning
pipeline for Course Recommendation system that uses Neural Collaborative Filtering and
Transformer models . The pipeline is based on Apache Spark and employs No SQL Mongo DB.
Integrated the Topic matching ML algorithm based on LDA and LSTMs with the main Java App
LMS. Gained Hands on knowledge of Tensorflow, Pytorch, Keras, while specializing in
implementing /training NLP models.
Created the output REST APIs for Course recommendation/Topic matching engine ,consumed by
the LMS web app and developed REST API calls to return user feedbacks to the ML engine.
Participated in complete software development lifecycle, including system design ,performance
analysis, development and testing of the product 'Learning Management System'
,a Java EE and
cloud based multi-tenant architecture application. Gained experience on Docker, Tomcat, and with
backend frameworks: Spring and Ibatis.
Optimized and refactored complex SQL queries for high DB load by leveraging advantages offered
by a column oriented in memory database ,resulting in SQL query execution time being reduced by
4 times in specific cases.
Performed index refactoring to greatly limit the DB RAM usage, in preparation for the app
switching platform from SAP cloud to hyperscalers .
Optimized app's background data processing by implementing v2 batching framework for high DB
load background jobs.
Caching through Kafka: Improving the legacy cluster messaging module (which used message
logging via DB tables for synchronizing cache between nodes) by refactoring it to employ Apache
Kafka.
Algorithm : https://doi.org/10.47750/pnr.2022.13.S04.202