Summary
Overview
Work History
Education
Skills
Certification
Onsite Experience
Award/Appreciation
Additional Information
Timeline
Generic

Nagesh B S

Site Reliability Engineer / DevOps Manager

Summary

Innovative professional with 16+ years of accomplished Techno Managerial experience in the areas of DevOps and Site Reliability Engineering (SRE) ,Infrastructure Automation,openshift , kubernetes, Dockers ,AWS,AZURE,GCP, Python,shell script,Ansible,Performance and Capacity Management,IT Operation Management ,Project Management, Application Management ,DBaas Team management , Budget and Demand Management, Agile Scrum Master , Unix/Windows System Administration, storage administration , Networking,Kafka, and Training consultant by delivering optimal results & business value in high-growth environments. Looking for an opportunity in SRE /DevOps Lead where I can prove myself to the best of my abilities.

Overview

17
17
years of professional experience
7
7
years of post-secondary education
9
9
Certifications

Work History

Site Reliability Engineering /DevOps Manager

Hitachi Digital Services
2022.03 - Current
  • Provide technical support for enterprise-level application systems
  • Research, diagnose, troubleshoot and identify potential solutions for how to resolve an issue and handle live production incidents, debug/troubleshoot application, and infrastructure issues, follow and implement SRE best practices
  • Automation-First mindset and experience in leading a team toward this.
  • Self-motivated, able and willing to help where help is needed.
  • Monitor application performance, take steps to improve overall application performance and stability and follow through with implementation
  • Build, maintain, and own InfoSec compliance efforts by implementing and enforcing appropriate processes and standards across the organization.
  • Work with the Engineering, Product, Delivery and Architecture teams to ensure that appropriate attention is given to 'Reliability Engineering.
  • Perform root cause analysis to identify the reasons for the underlying issue
  • Monitor all alerts related to applications and provide proactive services
  • Prepare and maintain documentation for all issues and the implemented solutions
  • Drive SRE education across the wider team to improve quality and reliability.
  • Mentor, coach, and develop a high performing Site Reliability team. Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams .

Site Reliability Engineer / DevOps Lead

Eurofins IT solutions India Pvt Ltd.
2021.03 - 2022.03
  • Drive best practices in Site Reliability Engineering and insure Secure, Scalable, Performant, and Highly Available Service
  • Drive SRE education across the wider team to improve quality and reliability.
  • Define a comprehensive set of SLIs,SLOs and Error Budget that are used to drive capacity decisions and respond to performance concerns
  • Sharing knowledge learned with coworkers and other extended teams.
  • Effectively respond to Monitoring alerts, incident tickets,Change Request,Problem Records, email requests or other channels coming in to Site Reliability Engineering team .
  • Mentor, coach, and develop a high performing Site Reliability team. Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams .
  • Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing .
  • Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement .
  • Work with the recruiting team to attract, onboard, and retain diverse top talent .
  • Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness.
  • Automation-First mindset and experience in leading a team toward this.
  • Self-motivated, able and willing to help where help is needed.
  • Able to build relationships, be culturally sensitive, have goal alignment, have learning agility .

Site Reliability Engineering Team Manager

Danske IT And Support Services India Pvt Ltd
2020.06 - 2021.02
  • Build architecture roadmap for team and accelerate Operation as a Service enablement with CI/CD and automation.
  • Drive architecture and design discussions with different cross functional team and work on geographically distributed teams.
  • Take and solve challenging problems, empower team members.
  • Automation-First mindset and experience in leading a team toward this.
  • Self-motivated, able and willing to help where help is needed.
  • Able to build relationships, be culturally sensitive, have goal alignment, have learning agility.
  • Assist in infrastructure migration process to IaC (Infrastructure as Code).
  • Build and deploy monitoring and alerting systems across our entire infrastructure.
  • Build, maintain, and own InfoSec compliance efforts by implementing and enforcing appropriate processes and standards across the organization.
  • Work with the Engineering, Product, Delivery and Architecture teams to ensure that appropriate attention is given to 'Reliability Engineering.
  • Ensure the means exist to quickly recover a degraded service (instrumentation, runbook, tooling etc).
  • Drive SRE education across the wider team to improve quality and reliability.
  • Significant experience working through the definition, design, release and run cycle of software products to markets - using Agile/Scrum methodologies.
  • Experience with DevOps, ITIL, Cloud Services, IT Infrastructure and Operations, including environment standup, server builds, firewalls, security and regulatory compliance.
  • Experience implementing and managing Logging, Monitoring and Alerting framework for hybrid cloud or third party services using AppDynamics, Splunk, Data Dog.
  • Experience with agile development (Scrum, Kanban, etc.) and within an agile project team (agile in ability to perform cross-functional tasks quickly) – balance multiple projects and collaborating closely with other development teams.
  • Design and implement an environment that is reliable, resilient, observable and sustainable to maintain.
  • Define a comprehensive set of SLIs,SLOs and Error Budget that are used to drive capacity decisions and respond to performance concerns.
  • Collaborate with and support multiple teams to incorporate features, provide technical answers, recommendations and deliverables in a timely manner.
  • Demonstrate passion for team success, personal growth and being part of something big.
  • Sharing knowledge learned with coworkers and other extended teams. Effectively respond to Monitoring alerts, incident tickets,Change Request,Problem Records, email requests or other channels coming in to Site Reliability Engineering team .
  • Mentor, coach, and develop a high performing Site Reliability team. Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams .
  • Work with the recruiting team to attract, onboard, and retain diverse top talent.
  • Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing.
  • Part of Center of Excellence and driving best practices for DevOps and Site Reliability Engineering.

Site Reliability Engineer /DevOps Lead

Amadeus Software Labs India PVT LTD
2015.12 - 2019.11
  • Directly manage a large geographically distributed team of talented Site Reliability Engineers
  • Managing cross functional teams like RES-OPS,Factory operation, Application Management Team ,automation team, capacity management team ,Demand and Budget Team ,DBaas Team,DevOps Team.
  • Lead the DevOps process by working closely with Engineering Management in an agile process
  • Drive best practices in Site Reliability Engineering and insure Secure, Scalable, Performant, and Highly Available Service
  • Effectively respond to Monitoring alerts, incident tickets,Change Request,Problem Records, email requests or other channels coming in to Site Reliability Engineering team
  • Collaborate with various internal teams to provide a high quality customer experience, and support
  • Escalate issues as needed to product development or service engineering team per documented procedures, while at the same time establishing a contingency plan to eliminate any intermittent service disruption
  • Have a maniacal focus on high availability, performance, scalability, and security of mission critical production services, 24x7x365 and keeping SLA Green & high CSAT
  • Participates in major incident resolution; driving recovery and handling executive level communications
  • Mentor, coach, and develop a high performing Site Reliability team
  • Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams
  • Interface with Dev/QA/OPS teams to identify root cause analysis and re-instrument triggers to prevent future Outages
  • Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement
  • Troubleshoot issues across the entire stack: hardware, software, application and network
  • Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization
  • Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress
  • Handling communication and providing transparency on major site issues to the executive management team
  • Manage on-call rotations and provide inputs to the team and partners to sustain SLAs
  • Document root cause analysis reports and develop standard operating procedures
  • Work with the recruiting team to attract, onboard, and retain diverse top talent
  • Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing
  • Demonstrable knowledge of Linux operating system internals, cloud hosting technologies (E.g., AWS, Azure, Google),
    containerization platforms (Docker, Kubernetes), architecting and implementing CI & CD structures.
  • Experience operating applications on public and private cloud solutions
  • Evaluates various architectural solutions and implementations and supports development and deployment of solutions as determined by the SRE team
  • Ensures effective implementation of the department budget. Prepares financial statements and monthly forecasts and reports. Prepares and analyzes monthly financial performance and makes budget and new technology recommendations
  • Identify and contribute to solutions for reducing services outages, reducing alert noise, improving monitoring, and
    helping define and our services reach Service Level Indicators(SLIs) and Service Level Objectives (SLOs)
  • Improving the availability and responsiveness of internal and external components and Platforms through the application of engineering best practices, tooling and instrumentation advances and cross organizational coordination
  • Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness
  • Automate, Automate, Automate.Work closely with Automation team with knowledge of configuration tools like Puppet, Chef or Ansible
  • Managed 2+ million Euro project for Supply chain automation on PEGA platform saving 8400 hours of effort each year
  • Managed 2+ million Euro project for database migration project with new technologies
  • Oversee multiple projects across all phases of development
  • Created virtualization roadmap to reduce physical server count by 80% and save ongoing annual savings of 1+ Million Euro by the virtualization of all physical servers
  • Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment
  • Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel
  • Provide leadership and direction to SRE staff that are responsible for break-fix, uptime and reliability for core services, distribution, and customer access network elements and related interfaces
  • Involved in RHEL5 Deco and Hardware replacement Project where i played a role of planning/procuring Hardware and software, creating the schedule plan for the server migration, Resource Management ,controlling the procurement plan to fit in the migration plan and budget
  • Involved in Stack7, Stack5 and EPL Deco Project where i led efforts in Planning: collecting requirements, H/W selection, capacity assessment, defining and sequencing activities to form schedule, coordinated with budget team and ensuring that the plan is executed
  • Involved in Stack8 Database project for the H/W selection, capacity planning, H/W Procurement and automation from serifa
  • Involved in DBaaS Kanban and Scrum Meetings
  • Create a governance structure with daily, weekly and monthly reports and provide snapshot of the same to internal and external stakeholders
  • Update runbooks, tools and documentation to help prepare on-call teams for future incidents.

Technical Specialist

HCL Technology Limited -IOMC
2012.10 - 2015.12
  • Experience in working with teams/customers across multiple geographies and cultural back ground
  • Project Manager having of experience in handling Operations team in 24/7 manner & keeping SLA Green & high CSAT
  • Knowledge of preparing detailed IR (Incident Report) and RCAs (Root Cause Analysis)
  • Create the Shift Plans considering the business criticality, team member vacations
  • Experience in managing customer escalations effectively
  • Guide the Shift leads and L1/L2 teams on following the processes
  • Establish and fine tune Operations/Support process, Define SLAs, Handover Process, come up with necessary templates
  • Experience in creating and presenting Dashboard and Reports for Customers and Senior Management
  • Hands on experience on managing risks, stakeholder expectation, and liaison with quality assurance team for a 360 degree service delivery
  • Identify gaps and risks around people, process and technology and drive continuous improvements
  • Take ownership of platform maintenance, SW and HW vendor relations, license expiry, certificate expiry
  • Coordinate Standard, Normal and Emergency changes with the Operations and Engineering teams
  • Get the change requests logged on time for production deployment, obtain approvals from customer and CAB on time for deployment
  • Drive the engines for Service Requests, Incidents, Problems and take timely action on corrective and proactive measures
  • Create a governance structure with daily, weekly and monthly reports and provide snapshot of the same to internal and external stakeholders
  • Manage a team of site reliability engineers to deliver cloud infrastructure and production operations for Unilever Consumer Goods services
  • Plan and manage capex/opex budgets that improve and sustain reliability
  • Experience hiring, mentoring, and training a staff of engineers
  • Guide/Mentor team members in troubleshooting application/web/system related issues
  • Provide technical leadership for a hybrid private/public cloud enterprise solution
  • Expert in performance and people management with a focus on mentoring and motivating engineers
  • Involved in AIX Cloud Migration Project. The objective was to migrate the Aix Operating system from old power4/p5 machines to power 7 private cloud infrastructure
  • Worked with offshore resources delivering SRE ,Capacity Management function of different technologies (Unix, Wintel, DB, Storage etc., Cloud Technologies - Amazon Web Services, MS-Azure, CLM etc) across different Landscapes globally
  • Support BAU operations (Storage Uplift, Memory Uplift etc.)

Senior Operations Professional

IBM India Pvt Ltd
2012.03 - 2012.09
  • Track record of driving extremely high levels of availability for web services with resilient architecture, scalable infrastructure, technical operations automation, 360 degrees of application performance monitoring, and a highly trained operations staff
  • Create and influence system design, standards, and processes that improve production reliability
  • Drive excellence for reliability through maintenance of SLAs, efficient process, automation development, engineering reliability back into applications and maximizing performance
  • Communicate effectively and present team progress to upper management
    Support and educate DevOps teams and consumers on using the standardized infrastructure services
  • Solid foundation in Linux or Windows administration and troubleshooting
    Be responsible for the overall uptime and performance of critical application cloud services
  • Identify recurring problems and build the tools and processes to prevent problems from recurring
  • AIX 4 Deco and H/W Replacement Project (project coordinator for CNA Insurance client) The main objective was to move all the Database servers from their current stack to latest version in order to prevent security breaches and ensuring that we are in compliance to PCI standards
  • Assessing the actual spend vs.Planned and triggering the appropriate risk response
  • Manage stakeholder expectation and project information
  • Monitor project performance, manage project communication

Training Consultant & System Administrator

IBM Partners & Client's
2008.03 - 2012.03
  • Imparted training on IBM AIX for IBM internal and external clients across India and abroad like Dubai, Colombo, AbuDhabhi .
  • Imparting training for IBM AIX throughout India and abroad
  • System Administration for UNIX server environment which includes IBM AIX 6.1, 5.3, and 5.2
  • Server performance and Capacity analysis
  • Providing Server performance statistic for issue analysis, future growth plan
  • Sever Sizing, Server modeling, projection of workload on different server configuration
  • Preparing monthly and quarterly health check reports for servers and providing recommendations
  • Provide L2&L3 support as Unix System Administrator and performance analyst
  • Supporting More than 700 AIX servers with 24/7 on call support
  • Worked with Remedy Ticketing Tool to Resolve the UNIX issues
  • System Administration for UNIX server environment which includes IBM AIX 6.1, 5.3, and 5.2
  • Installing new software / file sets, packages and other third-party applications when required by the developer or business
  • Technology level (TL / ML), Fixes, emergency fix, APAR Updations
  • Disk management, LVM administration and File system management
  • User and Group Management, sudo access requested by customer
  • Worked on HMC, AMM and Lantronix console to manage the Servers
  • Good experience in Administration of HMC for IBM system p administration and Worked on versions 3 to 7.3.2. Good understanding in logical partitioning (LPAR), DLPAR
  • Backup and Restore Files, vg etc
  • Managing processor, memory utilization and customizing swap file system
  • Managing file systems which include creating, mounting, setting ACL permission
  • Managing files and directory permissions
  • Configuring and maintaining NETWORKING (NFS, SSH, RSH, FTP) services
  • Creating VG and LV and extending LV, mirroring VG
  • Performance monitoring using vmstat, Topas, Nmon and netstat
  • Maintaining LPAR, WPAR, HMC, HACMP, VIOS, IBM P SERIES servers
  • Configuring and Troubleshooting VG, LPAR, WPAR, RAS, SECURITY on IBM AIX platform
  • Scheduled periodic, future jobs using crontab and at

System Administrator

Accenture pvt ltd, M/S Crux Consultants
2007.01 - 2008.03
  • Monitoring performance of Unix and windows server
  • Creating VG and LV and extending, mirroring VG
  • Manage, monitor, install, upgrade, configure, and troubleshoot servers and network equipments
  • Provides server support and maintenance, uses various utilities to trouble-shoot, repair, and check configuration of server
  • Running Performance monitor commands ( nmon , vmstat , iostat )
  • Managing paging space
  • Remote Desktop Support through Logmein Rescue
  • Trouble shooting of Hardware through Third Party Tools
  • Installing Software Applications & Fine Tuning the PC's
  • Remote monitoring and management of 2003 and 2008 servers, Active Directory based domains

Education

M.B.A - Information Technology

Sikkim Manipal University
Sikkim
2009.07 - 2012.07

Bachelor of Engineering - Electronics & communication

Visvesvaraya Technological University
Karnataka
2002.06 - 2006.06

Skills

    DevOps and Site Reliability Engineering

Docker,Kubernetes

Linux,Windows,Storage,Networking,Load balancer

Cloud : AWS,AZURE,GCP,Openshift,Openstack

Safe Agile and Certified Scrum Master Role,Kanban,Lean.

ITIL V3/V4 ,Service Design

Team Management

Jira,Confluence ,sharepoint,Microsft Office,Project

Programming language : Python ,Java ,C,Angular

Configuration tools: Puppet, Chef,Ansible,Terraform,

Log index/search platforms: Splunk

Scripting : Python , Bash,Shell

Monitoring & Data analysis : BCO,Prometheus, Grafana,nagios,InfluxDB,AppDynamics,ELK,DataDog

CI/CD : Jenkins, Gitlab, BitBucket,AzuDevOps

Network troubleshooting :TCP, DNS, IPv6 and tcpdump

Operating System : Aix , Linux(RHEL), Windows ,Solaris,HP UX

Database : Oracle ,MySqL,MongoDB,DB2

Web-server: Apache 2x, iPlanet and IIS

App Servers: Web logic, Web sphere ,Apache tomcat, Jboss and Glassfish

Incident Management ,Change Management,Problem Management

App Support and Release Management

RabbitMQ Messaging, Redis Caching,REST API services

Chaos Engineering

Certification

Microsoft Certified Technology Specialist on Windows server 2008 Enterprise network infrastructre ( MCTS: SR6257517)

Onsite Experience

Onsite ( Dubai,Abu dhabi,Colombo,Singapore ) work experience.
Frequently travelled to England and Germany (Project work)

Award/Appreciation

Received Star Award and Spot Award from Danske IT for the successful implementation of SRE/DevOps best practices.

Received High-Flyer Award for efforts towards Decommissioning of old N/W,RHEL5 ,5+ year old H/W without any customer impact.
Received DREAM SQUADRON AWARD from Amadeus Labs for making GOPS an Inspired, Motivated and High performance team.
Won TPE Amadeus division leve l Men's Volleyball Tournament in 2017.
Received Special Appreciation Award from the Manager for the significant role in Stack5,stack7 and Stack8 Migration.
Received appreciation from company high level management for the presentation to client on ecommerce project, Handled the technical discussion - with client and the clients were impressed for in-depth knowledge on the subject.

Received best capacity planner award in HCL from the client Unilever .

Received Best Team Lead award for transforming the team into SRE/DevOps Mindset .

Received Certificate for Amadeus Systems Security Training completion.

Additional Information

PERSONAL DOSSIER:
Date of Birth: 10 July, 1983
Passport No: Z5888325
Valid Till: Feb 3rd,2030
Languages Known: English, Hindi and Kannada

Timeline

Site Reliability Engineering /DevOps Manager

Hitachi Digital Services
2022.03 - Current

Site Reliability Engineer / DevOps Lead

Eurofins IT solutions India Pvt Ltd.
2021.03 - 2022.03

Site Reliability Engineering Team Manager

Danske IT And Support Services India Pvt Ltd
2020.06 - 2021.02

Site Reliability Engineer /DevOps Lead

Amadeus Software Labs India PVT LTD
2015.12 - 2019.11

Technical Specialist

HCL Technology Limited -IOMC
2012.10 - 2015.12

Senior Operations Professional

IBM India Pvt Ltd
2012.03 - 2012.09

M.B.A - Information Technology

Sikkim Manipal University
2009.07 - 2012.07

Training Consultant & System Administrator

IBM Partners & Client's
2008.03 - 2012.03

System Administrator

Accenture pvt ltd, M/S Crux Consultants
2007.01 - 2008.03

Bachelor of Engineering - Electronics & communication

Visvesvaraya Technological University
2002.06 - 2006.06

Microsoft Certified Technology Specialist on Windows server 2008 Enterprise network infrastructre ( MCTS: SR6257517)

IBM Certified System Administrator - AIX 6.1. (SR7764998)

ITIL certified

ITIL Service Design Certified.

SAFE 5 Certified

Certified Scrum Master

Certified in Splunk 7 Fundamentals Part 1 (eLearning) completion.

Certified in AWS,AZURE,PYTHON,DEVOPS (eLearning) completion.

Certified Chaos Engineering Professional

Nagesh B SSite Reliability Engineer / DevOps Manager