Summary
Overview
Work History
Education
Skills
Timeline
Generic

Mudit Mathur

Engineering Leader, Platform / SRE
Kadubeesanahalli, Bangalore

Summary

As a strategic and customer-focused platform engineering leader, have a passion for architecture, engineering, and operations. With a proven track record of building highly reliable platforms and reducing TOIL for product engineering teams, dedicated to delivering a paved path for teams to consume and accelerate adoption towards modernized platforms.

Successfully managed teams of 20+ individuals (across regions), including senior managers, principal engineers, and SDEs, and have consistently delivered outcomes with a data-driven approach. Focused on improving customer satisfaction and reducing incident resolution time has resulted in increased platform uptime and improved user experience.

Overview

16
16
years of professional experience
4
4
years of post-secondary education

Work History

Engineering Lead, Platform Engineering / SRE

Twilio
06.2023 - Current

As SRE Leader:

  • Led the Twilio SRE Summit to drive focus on migrating Twilio Service Stack to EKS clusters with the vision and focus of becoming regional native and adoption of standards that helps app developers reduce congnitive load of ensuring service code runs independently of the region thereby moving towards the goal of Twilio to be regionally available for its customers.
  • Driving the vision and strategy for SRE, to define clear roles and responsibility across Platform and SRE and help reduce overlap of work across engineers working into two different groups and giving them opportunity to collaborate effectively.
  • Driving the AI mindset through various initiatives to empower and embrace SRE engineers towards AI.

Reducing Engineering cost of onboarding services to EKS Clusters:

  • Delivered Namespace as a Service solution to improve DevEx for our App Developers to seamlessly configure their namespace across our EKS clusters and also managing resource quota, access control (for SOX / HIPPA Compliance) through a orchestration layer through CRDs using gitops flow. THis reduced the TOIL by more than 90% (from days to 10 minutes) on App developers to provision and manage their namespaces.

SRE AI Initiative:

  • Using AWS Bedrock LLM model to provide support to App developers to get solutions through our AI bot and trained the model with the knowledge base covering internal product docs, wikis, insights from slack histories to provide responses that required expert individuals. This helped improved the productivity of platform engineers by more than 30% where they used to spend time responding to developer queries or helping engineers find right info across the documentation.
  • Using AWS Sagemaker, building analytics from Cloudtrail logs to identify usage patterns across AWS resources through thousands of AWS accounts and bringing in compliance, reducing AWS footprint, reducing cost by decommissioning orphaned resources. Integrating the data with Superset to build meaningful insights.

IAC Initiative:

  • Build a tool called Terraform Resource Importer, which builds the terraform + workspaces for the resources that are manually managed and transitioning them into modularized based terraform that will be managed by individual teams.
  • Build a validation and deployment orchestration engine called regional builder (RB) that inspects existing deployment (Helm) for a service running in EKS and deploys its across multiple regions and then runs it across 50+ validations to ensure app code is regionally compliant and entire service fleet can be deployed into new region with a click of a button. This seamlessly integrate with Infra stack like EKS registry, Buildkite, ArgoCD, Datadog, Istio, Backstage.
  • Currently working on building a single deployment pipeline for safer and faster deployments which can progressively deploy app workloads with multi-region, multi-cluster deployments using Harness ( looking at alternatives with Spinnaker as well).

Sr. Engineering Manager, SRE (Flex)

Twilio
03.2022 - 06.2023
  • Spearheaded the setup of a new Platform Engineering Team in India focused on Infrastructure as Code (IAC) and Site Reliability Engineering (SRE), leading to a 20% reduction in toil and cognitive load for engineering teams within the first 6 months of operation.
  • Developed and implemented a DevOps culture within Twilio's CXP organization by paving the path towards an open stack framework, resulting in a 30% increase in customer feature delivery over the course of 12 months.
  • Led the migration of self-managed MySQL clusters to Aurora RDS Clusters with zero downtime, resulting in a 50% reduction in manual maintenance tasks for the engineering team and improving database availability by 20%. Moving towards the goal for 99.99% SLA. Achieved RPS of 20ms to 6ms for all API calls.
  • Created a migration path to transition Propertystore/Hazelcast clusters to DynamoDB, resulting in a 40% improvement in database performance and a 15% decrease in costs for database maintenance.
  • Developed shared Terraform modules that streamlined cloud infrastructure setup and included built-in observability metrics, reducing infrastructure setup time by 30%.
  • Provided guidance to teams to define SLI metrics for stateless services and configured SLOs and error budgets for 200 services using Terraform, resulting in a 25% decrease in service downtime and a 10% increase in customer satisfaction.
  • Built undifferentiated SLO's for services and defining the RAG status for a group services using standard SLI metrics and thresholds built better visibility and monitoring capabilities using datadog (APMs).
  • Built synthetic test framework to provide uptime of a service and defining external SLA for customers.
  • Established a Post-Incident Review (PIR) process for the engineering team, reducing Mean Time to Detect (MTTD) by 50% and improving incident response time by 40%.
  • Automated runbooks for infrastructure alerts such as host replacement and EC2 instance scaling using Buildkite and ArgoCD, leading to a 30% reduction in manual intervention and a 20% increase in system availability.
  • Working towards building the transition path from ECS to EKS with zero downtime to cutover Production Traffic.
  • Building multi-cloud integration with core services running in AWS and AI functionalities leveraged through Google CCAI Services.

Data Platform Engineering Leader (Reliability)

Cisco Systems India Pvt Ltd
09.2021 - 03.2022
  • Led the Data Platform team of 20+ individuals including engineers, leads, and architects to deliver and execute on data services for Webex Contact Center customers while adhering to Agile practices for continuous delivery, resulting in a 25% increase in customer satisfaction ratings.
  • Collaborated with cross-functional Engineering leaders to deliver solution-level features and ensure timely delivery, resulting in a 30% increase in feature delivery rate within the first 6 months of tenure.
  • Worked closely with Product Management team to prioritize backlog and define quarterly execution plan, leading to a 20% increase in on-time delivery of customer-requested features.
  • Streamlined engineering interruptions to ensure maximum productivity, resulting in a 15% reduction in downtime due to interruptions and a 10% increase in engineer productivity.
  • Successfully led the migration of customer data from old cloud platforms to new technology stack, resulting in a 50% reduction in data migration time and a 20% decrease in data errors during migration.
  • Defined and groomed Reliability backlog with architects to ensure 99.99% uptime for data services and ability to handle concurrent load of 15k agents with alerting and monitoring in place, resulting in a 40% decrease in service downtime and a 25% improvement in data availability.

SRE Leader, Platform Engineering and Operations

Cisco Systems India Pvt Ltd
01.2021 - 09.2021
  • Formed and managed team of 6 Incident Commanders for proactive and reactive incident handling, root cause management, and platform uptime of Contact Center Platform, resulting in 60% decrease in initial incident response time.
  • Managed 5 DevOps engineers and vendor relationships with Amazon and Google for Contact Center Platform and AI services.
  • Developed workflows, dashboards, and 100+ alerts for monitoring and alerting using Prometheus, Kibana, and PagerDuty. This effort resulted in 20% improvement in platform uptime, reducing downtime for customers and improving overall service availability. Additionally, incident resolution time was reduced by 15%, leading to faster issue resolution and improved customer satisfaction.
  • Defined Root Cause analysis framework and created more than 50+ playbooks based on learning from postmortem calls to ensure no repeat incidents and better change management.
  • Implemented enhanced Change Advisory Board (CAB) review process and streamlined CI/CD pipelines and K8s Deployment using Spinnaker and Jenkins, resulting in 25% reduction in code delivery time and 30% increase in successful deployments from non-prod to prod-environment. These improvements led to 15% increase in customer satisfaction ratings due to faster delivery of high-quality code.

Sr. Manager, Customer Success ( Operations )

Cisco Systems India Pvt Ltd
02.2015 - 01.2021
  • Part of Cisco Customer Experience Engineering group, Leading team of 20+ Highly experienced Engineers / Technical Leads, with key responsibilities to provide Level 4 Technical Support for Webex Contact Center solution (Premise + Cloud).
  • Build WxCC Escalation team ground up from handling 2 escalations to 7 escalations per engineer per day. With same OpEx Cost (Headcount), we were able to manage volumes from 864 cases quarterly to 1274 cases quarterly with same headcount.
  • Worked with teams in taking MTTR of urgent cases from 10 days to 3 days and non urgent cases from 14.5 days to 2.3 days, Month over month, with exceeding goal set as 5 days and 8 days, for Urgent and Non-Urgent escalations respectively.
  • Worked closely with TAC to reduce %escalation rate from 37% to 17.5% , Quarterly, for WxCC and 7% to 1.4% for Premise (UCCX).
  • Focused on identifying serviceability gaps with TAC and teams were able to open more than 50+ JIRA defects to improve logging enhancements for better troubleshooting.
  • Organized and participated Gorilla Testing (GT) on Staging environments and identified more than 200+ JIRA defects that could have potentially impacted customers in Production.
  • Using Analytics tools like Kibana and Tableau, build dashboards to provide insights on improvements areas and brought down existing defects escalation rate from 53% to 27% , high solution complexity issues from 34% to 12.5%, thereby making sure only product related escalation goto BU from TAC.
  • Built single escalation platform for TAC Engineers to escalate cases to Cisco Internal or OEM partners like Google, Calabrio, Acqueon, which helped TAC engineers to reduce their First Acknowledgement from 4 days to 24 hours.
  • Built a SWAT team for WxCC Geo rollout that worked closely with Engineering team to validate end to end solution with OEM call flows and provide go-live sign off in 30 days from time application was deployed in a DC.
  • Created migration team which focuses on customer use cases of premise and migrate them to cloud. This team was able to successfully migrate customer call flows , scripts, integrations, desktop integrations in 7 days from discovery to go live.
  • Created Webex BOTs to provide automation for Trivial escalations like backend config change on legacy platforms, which helped TAC reduce MTTC for config change cases from 7 days to 6 hours.
  • Built various other Webex BOTs to help provide incident updates, Long open JIRA support tickets, Non-TAC Functional teams to open escalation with support teams, recent deployment changes, etc, to help TAC / Escalation engineers to get better insights on recent changes and better case handling.
  • Build career growth plan for team and promotion pipeline, to ensure engineers groom personally and professionally.
  • Active CiscoLive Speaker and delivered key technical sessions: Single Sign On in CiscoLive, Barcelona | Best Practices for upgrading Enterprise Contact Centers in CiscoLive, Orlando. Presented on multiple contact center topics in Cisco Partners summits Organized in Singapore, Amsterdam, Boxborough and Bangalore with Hands-on lab.

Lead - Network Consulting Engineer

Cisco
01.2013 - 01.2015
  • Managed Services / Technical Consulting.
  • SME in designing contact center solution and provide detailed level architecture for APJ Partners / Customers which caters to customer business needs and deliver to measurable outcomes for business growth.
  • Leading Technical engagements with BFSI / Enterprise customers in India with average agent seat size of 2000+ agents, such as VIL(7500 Agents Seats), Infosys (Multiple contact centers with 5000 agent seats), TCL, HDFC (3500 Outbound concurrent agent seats), SBI, Kotak, Accenture (7500 agent seats, with agents across multiple geo locations), DBS (3500 agents across Singapore, Philippines, Hong Kong, India) etc.
  • Work closely with Cisco partners to deliver solution design consulting for Contact Center including integration with 3rd party applications and software.
  • Was part of Cisco's Contact Center Tiger Team to resolve highest customer escalated cases (CEO Level Escalation or CAP). Helped customers like DBS, ICBC, Accenture, HDFC, Kotak Mahindra Bank, Infosys, Ministry of Manpower, Mandiri Bank, Bangkok Bank, VIL,etc, to come out of CAP and be successful with Cisco Contact Center.
  • Provide Day 0 / Day 1 /Day 2 support in delivering complex projects with strict timelines.
  • Prepared and delivered Root Cause Analysis (RCA) to C-Level Executives from Partners / Customers.

Customer Support Engineer

Cisco
01.2010 - 01.2013
  • Provided structured methodologies of log collection and critical inputs required for TAC from Cisco Partners managing contact centers, like ANI, DNIS, Duration and Timestamp (from Agents) and playbooks for log collection across application or browsers, thereby reduce log collection iteration to 1.
  • Wrote Troubleshooting wikis for new TAC engineers to quickly identify problems and build knowledge base for TAC. Wrote around 50+ wikis and was nominated as top contributors to Cisco KB base.
  • Represented TAC on CAP cases and handled more than 20+ CAP cases for EU / APAC / US region, by being onsite and reduce log collection and TAC turnaround time from 2 days to 2-3 hours, which significantly helped customers to restore their services and find quicker resolutions or setting up acceptable workarounds.
  • Handling customers who have call volumes more than 30 cps and more than 3000+ concurrent agents and 24x7 customers.
  • Maintaining NPS more than 4.85 and CSATs at 4.95.
  • Deliver Technical Trainings (TOIs) to new TAC Engineers.

NOC Engineer

HCL Limited
01.2009 - 01.2010
  • Supporting Nortel Switches and providing operational support to Contact Center Agents.
  • Ensure service uptime at 99.99%
  • Monitored and maintained network and software components according to established guidelines and best practices
  • Provided regular status updates to customers regarding open tickets
  • Defined enterprise processes and best practices and tailored enterprise processes for applications

Education

Bachelor of Engineering - IT-Honors

University of Rajasthan
Jaipur
01.2005 - 04.2009

Skills

Acquisition and Integration

Timeline

Engineering Lead, Platform Engineering / SRE

Twilio
06.2023 - Current

Sr. Engineering Manager, SRE (Flex)

Twilio
03.2022 - 06.2023

Data Platform Engineering Leader (Reliability)

Cisco Systems India Pvt Ltd
09.2021 - 03.2022

SRE Leader, Platform Engineering and Operations

Cisco Systems India Pvt Ltd
01.2021 - 09.2021

Sr. Manager, Customer Success ( Operations )

Cisco Systems India Pvt Ltd
02.2015 - 01.2021

Lead - Network Consulting Engineer

Cisco
01.2013 - 01.2015

Customer Support Engineer

Cisco
01.2010 - 01.2013

NOC Engineer

HCL Limited
01.2009 - 01.2010

Bachelor of Engineering - IT-Honors

University of Rajasthan
01.2005 - 04.2009
Mudit MathurEngineering Leader, Platform / SRE