Summary
Overview
Work History
Education
Skills
Core Experience
Projects and Contributions:
Certification
Websites
Timeline
Generic

Vinoth Subbiah

Chennai

Summary

SRE Manager, with 21+ years of global experience across the UAE, USA, and India, leading large-scale reliability, cloud, and platform engineering initiatives for high-availability enterprise systems. Proven track record of building and mentoring SRE teams, owning 24×7 incident and security on-call operations, and acting as Incident Commander for major outages and security events.

Deep expertise in designing and operating resilient, scalable platforms across Azure, AWS, VMware, and Kubernetes/OpenShift environments, with strong emphasis on SLO-driven engineering, error budgets, and automation-first operations. Successfully implemented self-healing systems, automated remediation, and proactive observability to significantly reduce MTTR and operational toil.

Strong background in DevSecOps and compliance-driven environments (PCI-DSS, GDPR), embedding security controls and automated checks into CI/CD pipelines. Experienced in performance optimization, disaster recovery, and enterprise-scale traffic management using CDN and WAF platforms. Known for aligning reliability engineering with business goals, customer experience, and sustainable platform growth.

Overview

21
21
years of professional experience
1
1
Certification

Work History

Principal Consultant

Brandsafway
Banglore
06.2025 - Current

Senior SRE

Granicus India P Ltd
Chennai
10.2024 - 03.2025

Manager - Cloud and SRE

Emirates NBD
Dubai
08.2023 - 10.2024

Principal Consultant

Granicus - USA ( Payroll in LTIMindtree)
St Paul
12.2019 - 07.2023

Senior Consultant

Flydubai - Payroll in LTIMindtree
Dubai
01.2016 - 11.2019

Technical Manager

Mindtree
CHENNAI
07.2010 - 12.2015

Senior System Administrator

CMCK LLC ( Payroll in HCL Ltd)
CHENNAI
09.2008 - 07.2010

Support Engineer

Dell ( Payroll in Nebula)
Coimbatore
03.2008 - 08.2008

System Administrator

PRECISION TECHSERVE (P) Ltd
Trichy
05.2007 - 03.2008

System Administrator

Accel Frontline Ltd
Trichy
08.2004 - 05.2007

Education

Post Graduate Program - DevOps

Purdue University
Online
04-2023

Diploma in Computer Technology - Computer Science

Srinivasa Polytechnic College
Tiruchirappalli, India
04-2004

Bachelor of Computer Applications -

Amity University
UP

Skills

  • Cloud, Platform & Infrastructure:Azure, AWS, VMware Linux & Windows platforms Hybrid & cloud-native architecture Capacity planning & cost-aware reliability
  • SRE Strategy, Governance & Risk Ownership:SRE strategy & execution Reliability vision & roadmap Platform risk assessment SLOs, SLIs & error budgets Preventive reliability engineering Toil elimination & MTTR reduction
  • Site Reliability Engineering & Automation:Self-healing systems Automated remediation Infrastructure as Code (Terraform, ARM, CloudFormation) Reliability guardrails & standards
  • CI/CD, DevSecOps & Platform Enablement:Jenkins, GitLab CI/CD, Azure DevOps Automated deployments Shift-left security & compliance controls Secure, scalable delivery pipelines
  • Containers, Kubernetes & Platform Engineering:Docker, Kubernetes, OpenShift Helm GitOps practices Platform standardization & golden paths
  • Observability, Incident & On-Call Management:Enterprise observability (metrics, logs, traces) Prometheus, Grafana, ELK, Datadog, AppDynamics, New Relic Alerting & incident response (PagerDuty, Pingdom) Incident Command, RCA & learning culture
  • Networking, Traffic & Resilience:Load balancing & traffic management CDN & WAF (Akamai, F5, Imperva) DDoS mitigation High-availability & resilience design
  • Programming & Engineering Leadership:Python, Bash Code review for reliability & performance Latency optimization Resilience patterns & failure analysis
  • Messaging, Data & Caching:Kafka, RabbitMQ, IBM MQ AWS SQS/SNS, Azure Service Bus Oracle, MSSQL, MongoDB Redis & in-memory caching strategies
  • Executive Communication, Leadership & Governance:Engineering leadership Vendor & stakeholder management Executive-level reliability reporting Incident & escalation governance Jira, GitHub, GitLab

Core Experience

  • Cloud & Platform Engineering:Owned the design and evolution of hybrid and cloud-native platforms across Azure, AWS, and VMware, supporting mission-critical, high-traffic workloads. Established production-grade Kubernetes and OpenShift standards (Helm, GitOps) to ensure scalability, resilience, and operational consistency.
  • Site Reliability Engineering & Automation:Led SRE strategy with strong focus on reliability, error budgets, and operational excellence. Drove self-healing architectures and automated remediation to reduce MTTR and eliminate toil using Jenkins, GitLab, Terraform, Ansible, and Python.
  • Observability, Incident & On-Call Management:Defined observability strategy across 300+ microservices, governing SLIs, SLOs, and alerting aligned to business impact. Oversaw 24×7 on-call operations and acted as Incident Commander for major production outages, leading RCA and cross-functional resolution.
  • Security Operations & Disaster Recovery:Partnered with Security teams to integrate reliability and security operations, supporting PCI-DSS and GDPR compliance. Led regional DR design and testing using IaC and automation, and supported WAF, DDoS mitigation, and security incident response.
  • CI/CD, DevSecOps & Platform Enablement:Owned end-to-end CI/CD strategy, embedding IaC and automated quality and security controls into delivery pipelines. Enabled engineering teams with standardized, scalable deployment frameworks that improved release velocity and stability.
  • Networking, Performance & Resilience Engineering:Provided architectural oversight for enterprise networking, traffic management, CDN, and WAF platforms (F5, Akamai, Imperva). Led performance and resilience optimization initiatives—reviewing critical paths, reducing p95/p99 latency, and enforcing resilience patterns under peak load.
  • Leadership, Org Design & Cost Governance:Built and led globally distributed SRE and platform teams with clear ownership models and on-call structures. Owned capacity planning and reliability investment decisions, balancing availability, performance, and cost efficiency. Acted as a trusted partner to Engineering, Product, Security, and Business leadership to align reliability with customer experience and growth.

Projects and Contributions:

Enterprise Observability Platform Implementation:Led the design and rollout of a centralized observability platform spanning metrics, logs, traces, and alerts across hundreds of services. Standardized SLIs, SLOs, dashboards, and alerting policies, enabling proactive issue detection and significantly improving incident response and operational visibility across teams.

SLO-Driven Reliability Governance & Error Budgets:Introduced SLO- and error-budget–driven reliability governance to enable data-driven trade-offs between feature velocity and system stability. Embedded error budgets into release decisions, reducing repeat production incidents and improving overall service reliability.

Incident Management, On-Call Health & Alert Quality Optimization:Designed and governed 24×7 incident management and on-call practices, defining clear escalation paths, incident command roles, and RCA standards. Improved alert quality and reduced on-call fatigue, strengthening response effectiveness during major production incidents.

AI-Assisted Alert Noise Reduction & On-Call Optimization:Led adoption of AI-assisted anomaly detection and alert correlation to reduce false positives and alert noise. Improved signal-to-noise ratio, accelerated incident triage, and enabled engineers to focus on high-impact reliability issues.

DevSecOps & Shift-Left Reliability Enablement:Owned the integration of automated security and quality controls into CI/CD pipelines, embedding reliability and compliance checks early in the delivery lifecycle. Reduced production risk while maintaining delivery velocity in regulated environments.

Certification

  • AWS Certified Solutions Architect - Associate
  • AWS Certified SysOps Administrator
  • Microsoft Azure Administrator Associate
  • Microsoft Azure Architect Design
  • Microsoft Azure Infrastructure Solutions
  • VMware Certified Professional
  • Red Hat Certified Specialist in OpenShift Application Development
  • HashiCorp Certified: Terraform Associate
  • Certified Kubernetes Administrator (CKA)
  • LogicMonitor Certified Associate
  • PagerDuty Certified Foundational
  • PagerDuty Certified API Specialty
  • PagerDuty Certified Incident Responder
  • New Relic Full Stack Certified
  • Python Administrator
  • Microsoft Certified Technology Specialist
  • Microsoft Certified IT Professional
  • Microsoft Certified Professional
  • Microsoft Certified Implementing
  • Virtualization with Windows Server Hyper-V and System Center
  • Microsoft Certified Server

Timeline

Principal Consultant

Brandsafway
06.2025 - Current

Senior SRE

Granicus India P Ltd
10.2024 - 03.2025

Manager - Cloud and SRE

Emirates NBD
08.2023 - 10.2024

Principal Consultant

Granicus - USA ( Payroll in LTIMindtree)
12.2019 - 07.2023

Senior Consultant

Flydubai - Payroll in LTIMindtree
01.2016 - 11.2019

Technical Manager

Mindtree
07.2010 - 12.2015

Senior System Administrator

CMCK LLC ( Payroll in HCL Ltd)
09.2008 - 07.2010

Support Engineer

Dell ( Payroll in Nebula)
03.2008 - 08.2008

System Administrator

PRECISION TECHSERVE (P) Ltd
05.2007 - 03.2008

System Administrator

Accel Frontline Ltd
08.2004 - 05.2007

Post Graduate Program - DevOps

Purdue University

Diploma in Computer Technology - Computer Science

Srinivasa Polytechnic College

Bachelor of Computer Applications -

Amity University
Vinoth Subbiah