18+ years of experience in Engineering, Gen AI, Site Reliability Engineering, Cloud Native Infrastructure & Software Development, DevOps & DevSecOps.
Experienced in Budgeting, Developer Productivity Enhancement, Production Operations, Incident Management, Infrastructure Support & Customer Relations.
Conducted software security audits like SOC-2 and FEDRAMP.
Culture ambassador and servant leader who guides the team based on core values
Excellent stakeholder management and cross team/ cross geo collaboration.
With strong prioritization skills, handled multiple projects simultaneously and have a passion for learning.
Built high-performance teams in multiple high-growth technology companies across different locations.
As a customer advocate, played a key role in fostering a customer-obsessed culture.
Quality advocate focused on building robust platforms and large-scale systems with an emphasis on quality, reliability, scalability, maintainability, and supportability practices.
Overview
19
19
years of professional experience
1
1
Certification
Work History
Director, Site Reliability Engineering
Precisely
12.2021 - Current
Global Head of Site Reliability Engineering spread across India, US and Canada reporting to VP of SaaS
Team consists of infrastructure developers, AI developers, architects, site reliability engineers and devsecops engineers, responsible for ensuring reliability and stability of the Enterprise SaaS offering hosted by Precisely
Built the Cloud Native Infra for DI Suite & Automate Studio Manager
Budgeting and cost reduction for AWS, Datadog, MongoDB etc
Reduced cloud layer costs and other IT infrastructure costs, resulting in an annual savings of $1 million.
Built a Gen AI based chatbot for enhancing developer productivity and reducing SRE toil
Built Gen AI based incident management and response system to handle production systems
Very strong experience on AWS and Kubernetes, LLM, Mistral AI, pgvector, Terraform, MongoDB, Kafka
Introduced Datadog as a single pane of glass Observability Solution.
Introduced cloud native best practices for software development and released 180+microservices focusing on state of the art Devops and DevSecops practices including AWS, EKS, Docker, Helm, Datadog Concourse, Argo CD, Mongo DB, snowflake, Databricks, Prometheus, Sumologic, Splunk, ELK, AWS, Terraform, Cloudability, Apptio, Prisma Cloud, Temporal cloud
Built dashboards using Tableau, Datadog, Heap and Google Analytics data
Coached and mentored developers from a legacy software development background to adopt cloud native ways of software development by breaking down monoliths to microservices, defining SLIs and SLOs for microservices, building logs and by implementing telemetry and tracing
Introduced Managed services Kafka and built an automation for services to subscribe to Kafka
Introduced Terraform, MongoDB and Prisma cloud(Twistlock), Web Application Firewall(WAF)
Worked with legal for drafting SLA for DI Suite
Implemented a 24x7 oncall process for SRE and setup an Incident Management Process
Built a Disaster Recovery Plan and got the product SOC-2 certified
Now working on FEDRAMP.
Senior Manager, Site Reliability Engineering
QLIK
06.2016 - 12.2021
Developed the Site Reliability Engineering Practice at Qlik based on Google’s SRE pyramid
One of the first adopters ever in the industry
Managed a cross geographic Site Reliability Engineering team of 40 (APAC/EMEA and NA) and reporting into the Sr Director, Site Reliability Engineering at Qlik
Culture ambassador, who drives Qlik’s Core Values by example in my actions and interactions with the customer and within the organisation
Drove the Developer Productivity Initiative at Qlik
Achieved SOC-2 and FEDRAMP moderate certification for the product
Cost management initiative for all infra components within R&D
Budgeting, including forecasting, contract negotiation, cost optimization and cost control for tooling and cloud platform
Designing, developing, and implementing tools and automation needed for keeping the lights on for Qlik’s Enterprise SaaS offering
Pillars of focus: Observability, Scalability, Security, Cost, Reliability & Performance
Envisioned, designed and implemented an Incident Management process by using PagerDuty, Alert manager and Prometheus
Strategized, Developed & Implemented 24x7 Oncall for the first time in the history of Qlik R&D
Drove launch coordination including building and maintaining CI/CD pipelines, artifactory, designing and implementing the metrics stack for microservices coming into stage and production
Instrumental in building a customer obsessed culture within R&D at Qlik by building cross functional gaps between Support and R&D
Contributed significantly towards building a Customer First team
Gathered insights around customer usage metrics of the product and working in collaboration with product management advocating customer needs around the product functionalities
Hands on experience building Qlik Sense Apps for measuring KPIs and customer usage
Support Lifecycle Management: Working with customers and customer facing teams for escalated cases from product support, identifying bugs and defects, working closely with developers for fixing the same and thus responsible for the complete lifecycle management of support cases (Tools Used: ServiceNow, JIRA, Salesforce)
Built a SaaS Bug Triage Workflow for escalating bugs from Support to R&D at senior management level
Integrated various tools used with R&D and outside (Jira/Salesforce/ServiceNow) and built various workflows
Automation using bash and Python
Vendor management for new and existing tools (eg: Sumologic, Cloudwiry, Twistlock, Jfrog)
Hiring, Mentoring and Coaching- Performance Management.
Line Manager / Deputy Manager
Volvo IT
12.2013 - 06.2016
Managed a team of 30 people (VMware/OSD/SBC)
Drove Monthly KPIs: Reduction of transfer rate and reduction of lead times, Weekly SLA trend
Drove towards Continuous Improvement as part of VPS4IT, CATS compliance
Architected new VMware solutions and was SPOC for all Global VMware escalated Issues
Reduced incidents through automation and drove VPS4IT for the team (Lean Management system)
Hiring, Coaching & Mentoring of team members
Drove team meetings including huddles, problem solving sessions, service review meetings
Documentation: - BPDs, Run book, TCRP documents, Process flow documents