Summary
Overview
Work History
Education
Skills
PROJECTS DETAILS
PROFESSIONAL SYNOPSIS
Generic

Abdul Mazeed Mahammad

Hyderabad,TG

Summary

Engineering leader with 14+ years of experience across Site Reliability Engineering (SRE), Observability, Cloud Platforms, and Enterprise Software Development, including 6+ years leading and managing teams of 8–10+ engineers. Proven expertise in building and operating large-scale logging, monitoring, and observability platforms using Splunk, Datadog, Grafana, Azure Monitor, Dynatrace, and OpenTelemetry. Strong background in Azure, AWS, and GCP, CI/CD automation, vendor management, and agile delivery. Seeking a Manager, Software Engineering – Observability role where I can drive reliable, scalable, and high-performing systems while enabling team growth and continuous improvement.

Overview

15
15
years of professional experience
15
15
years of professional experience

Work History

Associate Director -SRE

UBS Hyderabad
Hyderabad
04.2025 - Current

Project 1

UBS Hyderabad

Technologies

Ø Cloud & Platform: Azure Cloud, Azure Monitor, Log Analytics, Azure App Insights

Ø Observability: Splunk, Grafana, OpenTelemetry (concepts), Datadog (logs/metrics)

Ø Containers & DevOps: Kubernetes (AKS), Docker, Azure CI/CD, Git

Ø Automation & Scripting: Python, Shell

Ø OS & Infra: Enterprise Linux

Ø ITSM & Agile: ServiceNow, Jira, Agile / ITIL

Project Description:

UBS is a global financial services organization delivering wealth management, asset management, and investment banking platforms. The systems supported are mission-critical, high-availability financial and risk platforms with strict regulatory and security requirements.

Roles and responsibilities:

Ø Act as Technical Lead / Engineering Lead for observability and reliability initiatives supporting business-critical financial platforms.

Ø Lead a team of SREs by providing technical direction, task prioritization, mentoring, and day-to-day operational leadership.

Ø Designed and enhanced enterprise observability dashboards and alerts using Azure Monitor, Log Analytics, Splunk, and Grafana, improving visibility and incident response.

Ø Implemented proactive alerting strategies aligned with SLIs and SLOs, reducing incident noise and improving system reliability.

Ø Supported and monitored containerized workloads on Kubernetes (AKS), ensuring platform stability and scalability.

Ø Developed Python-based automation tools to detect, triage, and resolve recurring production issues, significantly reducing MTTR.

Ø Built and maintained Azure CI/CD pipelines, ensuring safe, repeatable, and reliable deployments.

Ø Led major incident response and root-cause analysis, driving corrective and preventive actions in collaboration with engineering and product teams.

Ø Partnered closely with product management, development, security, and vendor teams to align reliability goals with business priorities.

Ø Played a key role in vendor transition and stabilization activities, including SOP creation, documentation, and knowledge transfer.

Ø Ensured compliance with ITIL processes (Incident, Change, Problem Management) and internal audit requirements.

Ø Regularly communicated system health, risks, and improvement plans to senior stakeholders, demonstrating strong customer and service orientation.

.

Application Consultant-SRE

IBM, Hyderabad
Hyderabad
06.2022 - 04.2025

Project Description:

Electrolux products sell under a variety of brand names (including its own), and are primarily major appliances and vacuum cleaners intended for home consumer use.

I have joined as an SRE to this project when it has been takenover from another vendor. I have played critical role to get the application related KTs successfully completed the project

Roles and responsibilities:

Ø Created JMeter performance testing scripts for SAP UI, identified performance bottlenecks, and collaborated with engineering teams on issue resolution.

Ø Initiated and led performance testing activities for FSM (Field Service Management) applications, successfully delivering two major performance initiatives into production.

Ø Designed and implemented Azure dashboards and proactive alerting solutions to improve system visibility and early issue detection.

Ø Provided technical leadership to a team responsible for 24x7 production support, as well as application design and development activities.

Ø Responded to application and system performance issues with rapid analysis and resolution to ensure service availability.

Ø Monitored production systems and services using Splunk, Stackdriver, Azure Application Insights, and Datadog.

Ø Acted as Subject Matter Expert (SME) for E-Commerce Catalog, Product Design & Development, and SAP C4C and FSM applications.

Ø Designed and developed automation tools to improve operational support capabilities and reduce manual effort.

Ø Ensured the support team consistently followed best practices while triaging Critical and High-priority production issues.

Ø Reviewed, prioritized, and coordinated fixes for production issues, ensuring weekly releases were delivered successfully.

Ø Actively participated in defining and executing action plans for escalated, customer-impacting, and critical issues.

Ø Mentored and guided team members to create knowledge base articles and technical documentation for key support topics.

Ø Owned post-production defects and change requests, ensuring appropriate prioritization and timely resolution.

Ø Analyzed and presented trend metrics and operational KPIs to senior management to drive continuous service improvement.

Ø Partnered with cross-functional support teams across the organization to continuously improve production stability and reliability in a 24x7 environment.

Associate Consultant – SRE

Tata Consultancy Services, Hyderabad
Hyderabad
06.2019 - 06.2022

Shop & Browse

GCP, Splunk, Java, spring boot ,Cloud Spanner ,CICD, Jenkins, Azure

Devops

Project Description:

Macy’s, Inc. is one of the Nation’s premier Omni channel retailers, company operating stores and Internet websites under two brands, Macy's and Bloomingdale's. The Company sells a range of merchandise, including apparel and accessories for men, women and children, cosmetics, home furnishings and other consumer goods. We Develop and Support Website applications of Macy’s (www.macys.com) and Bloomingdales (www.bloomingdales.com)

The Shop & Browse is a web application that is responsible for the rendering of the catalog browse experience for macys.com. The goal is to deliver the faceted category browse and PDP pages to customers for the more dynamic experience .NavApp interacts with Fast Common Catalog(FCC) Services via REST calls,NavApp then processes the result with its supporting modules which include caching with Akamai.

Upon requests from users/browser, It will check at akamai then request goes to respective xapi based on url pattern. Xapi will make a call to down stream applications Services as needed. Based on the results , xapi will make a single json object and it will handed over to UI

Roles and responsibilities:

Ø Provided technical leadership to a cross-functional team responsible for 24/7 production support, website operations, and application design & development.

Ø Led rapid incident response and root cause analysis to address application and system performance issues, minimizing downtime and customer impact.

Ø Proactively monitored production systems and services using tools such as Splunk and Google Stackdriver, ensuring high availability and reliability.

Ø Acted as a Subject Matter Expert (SME) for E-Commerce Catalog systems, product design, and end-to-end application development.

Ø Designed and developed automation tools and scripts to eliminate manual operational tasks and improve support efficiency.

Ø Ensured adherence to best practices for incident triage, particularly for Critical and High-priority (P1/P2) production issues.

Ø Reviewed, prioritized, and coordinated defect fixes and enhancements, managing weekly production releases with minimal risk.

Ø Actively participated in defining and executing action plans for escalated and customer-critical issues, collaborating with stakeholders.

Ø Mentored and motivated team members to build knowledge base articles, runbooks, and technical documentation for recurring issues.

Ø Owned post-production defects and change requests, ensuring accurate prioritization and timely resolution.

Ø Analyzed and presented trend metrics, incident patterns, and SLA performance to senior management to drive continual service improvement (CSI).

Ø Partnered with multiple internal support and engineering teams to continuously improve stability, resilience, and 24/7 operational readiness of production systems

Lead -SRE

Nisum Technologies
04.2018 - 06.2019

Project Description:

Macys.com is an online ecommerce application which allows the users to purchase the products online.

Ø We as a part of UFT (Unified Fast Track) are involved in the enhancement and changing activities of macys.com application.

Ø Like any other e-commerce applications, macys.com does maintain user accounts. It allows a customer to Create/Manage profiles, Checkout process, Registry, Subscription to newsletters.

Ø Keep him/her updated with the latest offers/promotional discounts.

Ø Macys.com uses email services to keep in touch with the customers.

Roles and responsibilities:

Ø Requirement Analysis and preparing documentation.

Ø Worked on enhancements in site using Java.

Ø Worked on testing of SOAP and REST web services.

Ø Developing POCs

Ø Effectively participated in bug fixing to deliver bug free Application.

Ø Coordinated with QA & Support teams.

Site Reliability Engineer

OSI Technologies
02.2011 - 04.2018

Education

MCA -

Noble PG college

Skills

  • Cloud Platforms: Azure, AWS, Google Cloud Platform (GCP)
  • Containers & Orchestration: Kubernetes, Docker
  • Infrastructure as Code & Automation: Terraform, Ansible
  • Observability & Monitoring: Splunk, Datadog, Grafana, Dynatrace, Azure Monitor, Log Analytics
  • CI/CD & DevOps Tools: Azure DevOps, Jenkins, Git, GitLab
  • Programming & Scripting: Java (Core & Advanced), Spring MVC, Python, Shell Scripting
  • Web & Integration: REST, SOAP, Servlets, JSP, JDBC, JMS
  • Application Servers: JBoss, WebSphere, Tomcat
  • Databases: Oracle, MySQL, DB2
  • UI Technologies: HTML, CSS, AJAX, jQuery, Mustache/Handlebars
  • Networking: TCP/IP, HTTP, F5, ASM
  • ITSM & Agile: Jira, ServiceNow, Agile/Scrum, ITIL

PROJECTS DETAILS

Project 1: UBS Hyderabad, Technologies: Cloud & Platform: Azure Cloud, Azure Monitor, Log Analytics, Azure App Insights, Observability: Splunk, Grafana, OpenTelemetry (concepts), Datadog (logs/metrics), Containers & DevOps: Kubernetes (AKS), Docker, Azure CI/CD, Git, Automation & Scripting: Python, Shell, OS & Infra: Enterprise Linux, ITSM & Agile: ServiceNow, Jira, Agile / ITIL, Project Description: UBS is a global financial services organization delivering wealth management, asset management, and investment banking platforms. The systems supported are mission-critical, high-availability financial and risk platforms with strict regulatory and security requirements., Roles and responsibilities: Act as Technical Lead / Engineering Lead for observability and reliability initiatives supporting business-critical financial platforms. Lead a team of SREs by providing technical direction, task prioritization, mentoring, and day-to-day operational leadership. Designed and enhanced enterprise observability dashboards and alerts using Azure Monitor, Log Analytics, Splunk, and Grafana, improving visibility and incident response. Implemented proactive alerting strategies aligned with SLIs and SLOs, reducing incident noise and improving system reliability. Supported and monitored containerized workloads on Kubernetes (AKS), ensuring platform stability and scalability. Developed Python-based automation tools to detect, triage, and resolve recurring production issues, significantly reducing MTTR. Built and maintained Azure CI/CD pipelines, ensuring safe, repeatable, and reliable deployments. Led major incident response and root-cause analysis, driving corrective and preventive actions in collaboration with engineering and product teams. Partnered closely with product management, development, security, and vendor teams to align reliability goals with business priorities. Played a key role in vendor transition and stabilization activities, including SOP creation, documentation, and knowledge transfer. Ensured compliance with ITIL processes (Incident, Change, Problem Management) and internal audit requirements. Regularly communicated system health, risks, and improvement plans to senior stakeholders, demonstrating strong customer and service orientation. Project 1: Electrolux, Technologies: Azure, Java, SAP C4C, FSM, Datadog, Project Description: Electrolux products sell under a variety of brand names (including its own), and are primarily major appliances and vacuum cleaners intended for home consumer use. I have joined as an SRE to this project when it has been takeover from another vendor. I have played critical role to get the application related KTs successfully completed the project, Roles and responsibilities: Created JMeter performance testing scripts for SAP UI, identified performance bottlenecks, and collaborated with engineering teams on issue resolution. Initiated and led performance testing activities for FSM (Field Service Management) applications, successfully delivering two major performance initiatives into production. Designed and implemented Azure dashboards and proactive alerting solutions to improve system visibility and early issue detection. Provided technical leadership to a team responsible for 24x7 production support, as well as application design and development activities. Responded to application and system performance issues with rapid analysis and resolution to ensure service availability. Monitored production systems and services using Splunk, Stackdriver, Azure Application Insights, and Datadog. Acted as Subject Matter Expert (SME) for E-Commerce Catalog, Product Design & Development, and SAP C4C and FSM applications. Designed and developed automation tools to improve operational support capabilities and reduce manual effort. Ensured the support team consistently followed best practices while triaging Critical and High-priority production issues. Reviewed, prioritized, and coordinated fixes for production issues, ensuring weekly releases were delivered successfully. Actively participated in defining and executing action plans for escalated, customer-impacting, and critical issues. Mentored and guided team members to create knowledge base articles and technical documentation for key support topics. Owned post-production defects and change requests, ensuring appropriate prioritization and timely resolution. Analyzed and presented trend metrics and operational KPIs to senior management to drive continuous service improvement. Partnered with cross-functional support teams across the organization to continuously improve production stability and reliability in a 24x7 environment. Project 2: Shop & Browse, Technologies: GCP, Splunk, Java, spring boot, Cloud Spanner, CICD, Jenkins, Azure Devops, Project Description: Macy’s, Inc. is one of the Nation’s premier Omni channel retailers, company operating stores and Internet websites under two brands, Macy's and Bloomingdale's. The Company sells a range of merchandise, including apparel and accessories for men, women and children, cosmetics, home furnishings and other consumer goods. We Develop and Support Website applications of Macy’s (www.macys.com) and Bloomingdales (www.bloomingdales.com) The Shop & Browse is a web application that is responsible for the rendering of the catalog browse experience for macys.com. The goal is to deliver the faceted category browse and PDP pages to customers for the more dynamic experience .NavApp interacts with Fast Common Catalog(FCC) Services via REST calls,NavApp then processes the result with its supporting modules which include caching with Akamai. Upon requests from users/browser, it will check at akamai then request goes to respective xapi based on url pattern. Xapi will make a call to down stream applications Services as needed. Based on the results , xapi will make a single json object and it will handed over to UI, Roles and responsibilities: Provided technical leadership to a cross-functional team responsible for 24/7 production support, website operations, and application design & development. Led rapid incident response and root cause analysis to address application and system performance issues, minimizing downtime and customer impact. Proactively monitored production systems and services using tools such as Splunk and Google Stackdriver, ensuring high availability and reliability. Acted as a Subject Matter Expert (SME) for E-Commerce Catalog systems, product design, and end-to-end application development. Designed and developed automation tools and scripts to eliminate manual operational tasks and improve support efficiency. Ensured adherence to best practices for incident triage, particularly for Critical and High-priority (P1/P2) production issues. Reviewed, prioritized, and coordinated defect fixes and enhancements, managing weekly production releases with minimal risk. Actively participated in defining and executing action plans for escalated and customer-critical issues, collaborating with stakeholders. Mentored and motivated team members to build knowledge base articles, runbooks, and technical documentation for recurring issues. Owned post-production defects and change requests, ensuring accurate prioritization and timely resolution. Analyzed and presented trend metrics, incident patterns, and SLA performance to senior management to drive continual service improvement (CSI). Partnered with multiple internal support and engineering teams to continuously improve stability, resilience, and 24/7 operational readiness of production systems Project 3: Macys – Unified Fast Track team, Client: Macys, U.S.A, Technologies: Java script , Java 8, Spring , JQuery, Web Services, Maven, JDBC Module, AWS, Team Size: 10, Project Description: Macys.com is an online ecommerce application which allows the users to purchase the products online. We as a part of UFT (Unified Fast Track) are involved in the enhancement and changing activities of macys.com application. Like any other e-commerce applications, macys.com does maintain user accounts. It allows a customer to Create/Manage profiles, Checkout process, Registry, Subscription to newsletters. Keep him/her updated with the latest offers/promotional discounts. Macys.com uses email services to keep in touch with the customers., Roles and responsibilities: Requirement Analysis and preparing documentation. Worked on enhancements in site using Java. Worked on testing of SOAP and REST web services. Developing POCs Effectively participated in bug fixing to deliver bug free Application. Coordinated with QA & Support teams. Project 4: Macys - Navapp (Navigation Application), Tools: Java, Spring, JSP, JMS, Web Services (SOAP & Rest ), TOAD, Eclipse , JBoss, SVN, Splunk, Dyna Trace, SVN, Soap UI, Akamai, BDB, Project Description: Macy’s, Inc. is one of the Nation’s premier Omni channel retailers, company operating stores and Internet websites under two brands, Macy's and Bloomingdale's. The Company sells a range of merchandise, including apparel and accessories for men, women and children, cosmetics, home furnishings and other consumer goods. We Develop and Support Website applications of Macy’s (www.macys.com) and Bloomingdales (www.bloomingdales.com) The NavApp is a web application that is responsible for the rendering of the catalog browse experience for macys.com. The goal is to deliver the faceted category browse and PDP pages to customers for the more dynamic experience .NavApp interacts with Fast Common Catalog(FCC) Services via REST calls,NavApp then processes the result with its supporting modules which include caching with Berkeley DB (BDB) and Akamai. Upon requests from users/browser, NavApp will make calls to FCC Services as needed. NavApp will then wrap the results in its domain context and return views for rendering with JSP pages in browser, Roles and responsibilities: Collaborated effectively with global team members and business stakeholders across regions to perform requirements gathering, analysis, and clarification. Reviewed and analyzed production issues based on severity levels (P1, P2, P3) and SLA commitments, ensuring timely resolution and minimal business impact. Coordinated closely with onsite teams and managed ticket and task transitions to ensure seamless support and ownership continuity. Ensured audit and compliance requirements were met by maintaining accurate tracking of issues and changes using JIRA Bug Tracker and First Choice ticketing systems. Supported major releases and critical production launches by performing application-layer log monitoring using Splunk and Dynatrace. Followed Agile/Scrum methodologies throughout the development and release lifecycle to ensure iterative delivery and quality outcomes. Played a key role in infrastructure capacity planning and scale-up activities during peak retail seasons and high-traffic events over multiple years. Analyzed and addressed security-related issues using Splunk and Akamai, contributing to improved application security and compliance. Ensured site stability and availability by efficiently handling SOC calls and major incident bridges. Coordinated with Project Managers, Business stakeholders, Development, and QA teams to drive effective delivery and issue resolution. Designed and implemented Splunk dashboards and alerts based on raw log data, enabling proactive monitoring and faster incident detection. Actively involved in bug fixing, application enhancements, and release validations to improve overall system quality. Participated in code reviews, ensuring adherence to coding standards, performance best practices, and reliability requirements. Contributed to continuous improvement initiatives by identifying recurring issues and driving preventive solutions.

PROFESSIONAL SYNOPSIS

  • Engineering Manager / Technical Lead with 14 years of experience across Site Reliability Engineering (SRE), Observability, Cloud Platforms, and Enterprise Software Development, including 6+ years leading and managing teams of 8–10+ engineers in 24x7 production environments.
  • Proven expertise in people management, including task allocation, performance reviews, coaching, mentoring, and career development, ensuring high-performing and motivated engineering teams.
  • Strong experience in driving delivery and operational excellence, ensuring engineering outcomes are delivered on time, within scope, and aligned with business priorities.
  • Extensive hands-on experience building and operating large-scale observability platforms, including logging, metrics, tracing, dashboards, and alerting, using Splunk, Datadog, Grafana, Dynatrace, Azure Monitor, and OpenTelemetry.
  • Demonstrated ability to partner closely with Product Management, Engineering, DevOps, and business stakeholders to define technical strategies and improve reliability, scalability, and performance of enterprise applications.
  • Proven track record of leading incident management and major outage response, driving root-cause analysis, corrective actions, and continuous improvement initiatives.
  • Hands-on experience implementing SRE best practices, including SLIs, SLOs, SLAs, and error budgets, to balance feature delivery with system reliability.
  • Strong background in automation and engineering productivity, designing solutions to proactively detect, triage, and resolve production issues, significantly reducing MTTR.
  • Solid understanding of modern distributed systems and web architectures, including TCP/IP, HTTP, DNS, load balancing, routing, and CDN-based delivery models.
  • Extensive experience managing and optimizing Akamai CDN, including edge configuration, caching strategies, and performance tuning to enhance global user experience.
  • Hands-on experience with Google Kubernetes Engine (GKE) and Cloud Spanner, supporting scalable cloud-native workloads.
  • Deep expertise in observability and monitoring, with hands-on experience creating actionable dashboards and alerts using Splunk, Datadog, Dynatrace, and Akamai.
  • Strong background in architecture, design, development, and maintenance of RESTful web services and server-side applications using the Java / J2EE stack.
  • Experience working with application servers such as JBoss, Tomcat, and Jetty, and source control systems including Git and SVN.
  • Extensive experience leveraging AWS services (EC2, S3, RDS, Lambda) and implementing infrastructure automation using Terraform and CloudFormation to build scalable, cost-efficient systems.
  • Designed and deployed secure, compliant cloud architectures using best practices around IAM, VPC networking, and encryption.
  • Possess strong multi-cloud experience across Azure, AWS, and Google Cloud Platform.
  • Functional expertise in large-scale retail business applications, including Category Management, Product Management, and Marketing Content Management systems.
  • Strong experience in Enterprise Linux environments, including advanced troubleshooting such as thread and heap dump analysis.
  • Well-versed in ITIL service operations, including Incident, Change, and Problem Management.
  • Demonstrated ability to lead and mentor teams of 10+ engineers, conduct knowledge-sharing sessions, and foster a culture of continuous learning and improvement.
  • Highly effective collaborator with cross-functional teams (Engineering, QA, DevOps, Product) to ensure seamless delivery and faster incident resolution.
Abdul Mazeed Mahammad