Summary

Overview

Work History

Education

Skills

Core Experience

Projects and Contributions:

Certification

Websites

Timeline

Vinoth Subbiah

Chennai

Summary

SRE Manager, with 21+ years of global experience across the UAE, USA, and India, leading large-scale reliability, cloud, and platform engineering initiatives for high-availability enterprise systems. Proven track record of building and mentoring SRE teams, owning 24×7 incident and security on-call operations, and acting as Incident Commander for major outages and security events.

Deep expertise in designing and operating resilient, scalable platforms across Azure, AWS, VMware, and Kubernetes/OpenShift environments, with strong emphasis on SLO-driven engineering, error budgets, and automation-first operations. Successfully implemented self-healing systems, automated remediation, and proactive observability to significantly reduce MTTR and operational toil.

Strong background in DevSecOps and compliance-driven environments (PCI-DSS, GDPR), embedding security controls and automated checks into CI/CD pipelines. Experienced in performance optimization, disaster recovery, and enterprise-scale traffic management using CDN and WAF platforms. Known for aligning reliability engineering with business goals, customer experience, and sustainable platform growth.

Overview

years of professional experience

Certification

Work History

Principal Consultant

Brandsafway

Banglore

06.2025 - Current

Senior SRE

Granicus India P Ltd

Chennai

10.2024 - 03.2025

Manager - Cloud and SRE

Emirates NBD

Dubai

08.2023 - 10.2024

Principal Consultant

Granicus - USA ( Payroll in LTIMindtree)

St Paul

12.2019 - 07.2023

Senior Consultant

Flydubai - Payroll in LTIMindtree

Dubai

01.2016 - 11.2019

Technical Manager

Mindtree

CHENNAI

07.2010 - 12.2015

Senior System Administrator

CMCK LLC ( Payroll in HCL Ltd)

CHENNAI

09.2008 - 07.2010

Support Engineer

Dell ( Payroll in Nebula)

Coimbatore

03.2008 - 08.2008

System Administrator

PRECISION TECHSERVE (P) Ltd

Trichy

05.2007 - 03.2008

System Administrator

Accel Frontline Ltd

Trichy

08.2004 - 05.2007

Education

Post Graduate Program - DevOps

Purdue University

Online

04-2023

Diploma in Computer Technology - Computer Science

Srinivasa Polytechnic College

Tiruchirappalli, India

04-2004

Bachelor of Computer Applications -

Amity University

Skills

Cloud, Platform & Infrastructure:Azure, AWS, VMware Linux & Windows platforms Hybrid & cloud-native architecture Capacity planning & cost-aware reliability
SRE Strategy, Governance & Risk Ownership:SRE strategy & execution Reliability vision & roadmap Platform risk assessment SLOs, SLIs & error budgets Preventive reliability engineering Toil elimination & MTTR reduction
Site Reliability Engineering & Automation:Self-healing systems Automated remediation Infrastructure as Code (Terraform, ARM, CloudFormation) Reliability guardrails & standards
CI/CD, DevSecOps & Platform Enablement:Jenkins, GitLab CI/CD, Azure DevOps Automated deployments Shift-left security & compliance controls Secure, scalable delivery pipelines
Containers, Kubernetes & Platform Engineering:Docker, Kubernetes, OpenShift Helm GitOps practices Platform standardization & golden paths

Observability, Incident & On-Call Management:Enterprise observability (metrics, logs, traces) Prometheus, Grafana, ELK, Datadog, AppDynamics, New Relic Alerting & incident response (PagerDuty, Pingdom) Incident Command, RCA & learning culture
Networking, Traffic & Resilience:Load balancing & traffic management CDN & WAF (Akamai, F5, Imperva) DDoS mitigation High-availability & resilience design
Programming & Engineering Leadership:Python, Bash Code review for reliability & performance Latency optimization Resilience patterns & failure analysis
Messaging, Data & Caching:Kafka, RabbitMQ, IBM MQ AWS SQS/SNS, Azure Service Bus Oracle, MSSQL, MongoDB Redis & in-memory caching strategies
Executive Communication, Leadership & Governance:Engineering leadership Vendor & stakeholder management Executive-level reliability reporting Incident & escalation governance Jira, GitHub, GitLab

Core Experience

Cloud & Platform Engineering:Owned the design and evolution of hybrid and cloud-native platforms across Azure, AWS, and VMware, supporting mission-critical, high-traffic workloads. Established production-grade Kubernetes and OpenShift standards (Helm, GitOps) to ensure scalability, resilience, and operational consistency.
Site Reliability Engineering & Automation:Led SRE strategy with strong focus on reliability, error budgets, and operational excellence. Drove self-healing architectures and automated remediation to reduce MTTR and eliminate toil using Jenkins, GitLab, Terraform, Ansible, and Python.
Observability, Incident & On-Call Management:Defined observability strategy across 300+ microservices, governing SLIs, SLOs, and alerting aligned to business impact. Oversaw 24×7 on-call operations and acted as Incident Commander for major production outages, leading RCA and cross-functional resolution.
Security Operations & Disaster Recovery:Partnered with Security teams to integrate reliability and security operations, supporting PCI-DSS and GDPR compliance. Led regional DR design and testing using IaC and automation, and supported WAF, DDoS mitigation, and security incident response.
CI/CD, DevSecOps & Platform Enablement:Owned end-to-end CI/CD strategy, embedding IaC and automated quality and security controls into delivery pipelines. Enabled engineering teams with standardized, scalable deployment frameworks that improved release velocity and stability.
Networking, Performance & Resilience Engineering:Provided architectural oversight for enterprise networking, traffic management, CDN, and WAF platforms (F5, Akamai, Imperva). Led performance and resilience optimization initiatives—reviewing critical paths, reducing p95/p99 latency, and enforcing resilience patterns under peak load.
Leadership, Org Design & Cost Governance:Built and led globally distributed SRE and platform teams with clear ownership models and on-call structures. Owned capacity planning and reliability investment decisions, balancing availability, performance, and cost efficiency. Acted as a trusted partner to Engineering, Product, Security, and Business leadership to align reliability with customer experience and growth.

Projects and Contributions:

Enterprise Observability Platform Implementation:Led the design and rollout of a centralized observability platform spanning metrics, logs, traces, and alerts across hundreds of services. Standardized SLIs, SLOs, dashboards, and alerting policies, enabling proactive issue detection and significantly improving incident response and operational visibility across teams.

SLO-Driven Reliability Governance & Error Budgets:Introduced SLO- and error-budget–driven reliability governance to enable data-driven trade-offs between feature velocity and system stability. Embedded error budgets into release decisions, reducing repeat production incidents and improving overall service reliability.

Incident Management, On-Call Health & Alert Quality Optimization:Designed and governed 24×7 incident management and on-call practices, defining clear escalation paths, incident command roles, and RCA standards. Improved alert quality and reduced on-call fatigue, strengthening response effectiveness during major production incidents.

AI-Assisted Alert Noise Reduction & On-Call Optimization:Led adoption of AI-assisted anomaly detection and alert correlation to reduce false positives and alert noise. Improved signal-to-noise ratio, accelerated incident triage, and enabled engineers to focus on high-impact reliability issues.

DevSecOps & Shift-Left Reliability Enablement:Owned the integration of automated security and quality controls into CI/CD pipelines, embedding reliability and compliance checks early in the delivery lifecycle. Reduced production risk while maintaining delivery velocity in regulated environments.

Certification

AWS Certified Solutions Architect - Associate
AWS Certified SysOps Administrator
Microsoft Azure Administrator Associate
Microsoft Azure Architect Design
Microsoft Azure Infrastructure Solutions
VMware Certified Professional
Red Hat Certified Specialist in OpenShift Application Development
HashiCorp Certified: Terraform Associate
Certified Kubernetes Administrator (CKA)
LogicMonitor Certified Associate
PagerDuty Certified Foundational
PagerDuty Certified API Specialty
PagerDuty Certified Incident Responder
New Relic Full Stack Certified
Python Administrator
Microsoft Certified Technology Specialist
Microsoft Certified IT Professional
Microsoft Certified Professional
Microsoft Certified Implementing
Virtualization with Windows Server Hyper-V and System Center
Microsoft Certified Server

Websites

https://www.linkedin.com/in/vinoth-subbiah/

Timeline

Principal Consultant

Brandsafway

06.2025 - Current

Senior SRE

Granicus India P Ltd

10.2024 - 03.2025

Manager - Cloud and SRE

Emirates NBD

08.2023 - 10.2024

Principal Consultant

Granicus - USA ( Payroll in LTIMindtree)

12.2019 - 07.2023

Senior Consultant

Flydubai - Payroll in LTIMindtree

01.2016 - 11.2019

Technical Manager

Mindtree

07.2010 - 12.2015

Senior System Administrator

CMCK LLC ( Payroll in HCL Ltd)

09.2008 - 07.2010

Support Engineer

Dell ( Payroll in Nebula)

03.2008 - 08.2008

System Administrator

PRECISION TECHSERVE (P) Ltd

05.2007 - 03.2008

System Administrator

Accel Frontline Ltd

08.2004 - 05.2007

Post Graduate Program - DevOps

Purdue University

Diploma in Computer Technology - Computer Science

Srinivasa Polytechnic College

Bachelor of Computer Applications -

Amity University