Summary
WORK PROFILE
Work History
Projects
Certification
Skills
Education
Generic
NagaRaju N

NagaRaju N

Hyderabad

Summary

Cloud Operations Engineer / Site Reliability Engineer with nearly 4 years of experience in GCP-based microservices. Proficient in leveraging a robust skill set that includes GCP / GKE, Prometheus, Grafana, Linux, Jenkins, Spinnaker, and Bitbucket. Expertise in building and maintaining comprehensive monitoring and alerting systems to support frequent and safe production deployments. Proven track record in managing on-call responsibilities while effectively coordinating across teams to enhance system reliability, performance, and cost-efficiency. Committed to continuous improvement and optimization of cloud infrastructure to drive operational excellence.

WORK PROFILE

  • Organization : TCS
  • Designation : Cloud Operations Engineer
  • Total Experience : 3.10 years ( MAY, 2022 - Present)
  • Project : Home Depot

Work History

  • Designed and maintained Prometheus metrics and Grafana dashboards and enhanced dashboards for log ingestion, enabling teams to quickly identify noisy services, control costs, and reduce risk from excessive logging that identified top 5 noisy services, leading to a 30% reduction in unnecessary log volume.
  • Standardized alert channels and policies to enhance incident response and operational efficiency. Consolidated fragmented alerting channels into a centralized Slack workflow, reducing alert fatigue by 40%.
  • Orchestrated CI/CD pipelines via Jenkins and Spinnaker for GCP/GKE applications, improving production deployment frequency from bi-weekly to daily.
  • Supported in production workflows for OM applications, resulting in a 20% faster lead time for code-to-production changes.
  • Managed deployment activities by validating post-deploy health using dashboards and alerts, and helping coordinate rollbacks when necessary.
  • Executed branching strategies and facilitated pull requests using Bitbucket to enhance code integration.
  • Conducted thorough investigations using Grafana, Prometheus, and GCP logs to pinpoint issues and collaborate with application teams for effective fixes.
  • Participated in on-call rotation and responding to production incidents impacting order creation, payments, and fulfillment by maintaining a 98% uptime SLA.
  • Developed and maintained 4+ comprehensive runbook and SOPs for critical alerts, standardizing incident response procedures and resulting in reduce MTTR.
  • Executed database management for SQL and PostgreSQL systems, efficiently resolving Jira tickets to insert and update critical data.
  • Engaged with cross-functional reliability teams to exchange best practices for optimizing alert design and incident response strategies.

Projects

Log Ingestion Alerting setup in Grafana

  • Defined the alerting strategy, dashboards, and monitoring for log ingestion across services on GCP, leveraging Cloud Logging, Prometheus, and Grafana.
  • Implemented alerts on Grafana dashboards to identify services exceeding thresholds, with clear runbook to engage respective app teams and reduce log noise.

Grafana metrics and alerts for order trend monitoring

  • Helped configure Grafana metrics and alerts for order trend monitoring by fulfillment type, payment type, and top SKUs, using a Big Query data source.
  • Set up alerts to detect 30% deviation in order volumes by compared to yesterday/last week, enabling early detection of order flow issues and coordinated validation with cross teams.

Certification

Google Cloud Certified Associate Cloud Engineer ( ID - 824c2eaf823d47e4be493e0973788e27 )

Skills

  • Google Cloud Platform (GCP) - GKE, Pub/Sub, Cloud SQL, Cloud logging, Monitoring
  • Grafana & Prometheus - metrics, alert rules, dashboards, alerts
  • Jenkins & Spinnaker - CI / CD and deployment for GCP / GKE
  • Bitbucket - Git-based workflows, branching, pull requests

Education

Bachelor of Technology -

Vishnu Institute of Technology
Andhra Pradesh
05-2021
NagaRaju N