Summary

Overview

Work History

Education

Skills

Certification

Onsite Experience

Award/Appreciation

Additional Information

Timeline

Nagesh B S

Site Reliability Engineer / DevOps Manager

Summary

Innovative professional with 16+ years of accomplished Techno Managerial experience in the areas of DevOps and Site Reliability Engineering (SRE) ,Infrastructure Automation,openshift , kubernetes, Dockers ,AWS,AZURE,GCP, Python,shell script,Ansible,Performance and Capacity Management,IT Operation Management ,Project Management, Application Management ,DBaas Team management , Budget and Demand Management, Agile Scrum Master , Unix/Windows System Administration, storage administration , Networking,Kafka, and Training consultant by delivering optimal results & business value in high-growth environments. Looking for an opportunity in SRE /DevOps Lead where I can prove myself to the best of my abilities.

Overview

years of professional experience

years of post-secondary education

Certifications

Work History

Site Reliability Engineering /DevOps Manager

Hitachi Digital Services

03.2022 - Current

Provide technical support for enterprise-level application systems
Research, diagnose, troubleshoot and identify potential solutions for how to resolve an issue and handle live production incidents, debug/troubleshoot application, and infrastructure issues, follow and implement SRE best practices
Automation-First mindset and experience in leading a team toward this.
Self-motivated, able and willing to help where help is needed.
Monitor application performance, take steps to improve overall application performance and stability and follow through with implementation
Build, maintain, and own InfoSec compliance efforts by implementing and enforcing appropriate processes and standards across the organization.
Work with the Engineering, Product, Delivery and Architecture teams to ensure that appropriate attention is given to 'Reliability Engineering.
Perform root cause analysis to identify the reasons for the underlying issue
Monitor all alerts related to applications and provide proactive services
Prepare and maintain documentation for all issues and the implemented solutions
Drive SRE education across the wider team to improve quality and reliability.
Mentor, coach, and develop a high performing Site Reliability team. Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams .

Site Reliability Engineer / DevOps Lead

Eurofins IT solutions India Pvt Ltd.

03.2021 - 03.2022

Drive best practices in Site Reliability Engineering and insure Secure, Scalable, Performant, and Highly Available Service
Drive SRE education across the wider team to improve quality and reliability.
Define a comprehensive set of SLIs,SLOs and Error Budget that are used to drive capacity decisions and respond to performance concerns
Sharing knowledge learned with coworkers and other extended teams.
Effectively respond to Monitoring alerts, incident tickets,Change Request,Problem Records, email requests or other channels coming in to Site Reliability Engineering team .
Mentor, coach, and develop a high performing Site Reliability team. Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams .
Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing .
Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement .
Work with the recruiting team to attract, onboard, and retain diverse top talent .
Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness.
Automation-First mindset and experience in leading a team toward this.
Self-motivated, able and willing to help where help is needed.
Able to build relationships, be culturally sensitive, have goal alignment, have learning agility .

Site Reliability Engineering Team Manager

Danske IT And Support Services India Pvt Ltd

06.2020 - 02.2021

Build architecture roadmap for team and accelerate Operation as a Service enablement with CI/CD and automation.
Drive architecture and design discussions with different cross functional team and work on geographically distributed teams.
Take and solve challenging problems, empower team members.
Automation-First mindset and experience in leading a team toward this.
Self-motivated, able and willing to help where help is needed.
Able to build relationships, be culturally sensitive, have goal alignment, have learning agility.
Assist in infrastructure migration process to IaC (Infrastructure as Code).
Build and deploy monitoring and alerting systems across our entire infrastructure.
Build, maintain, and own InfoSec compliance efforts by implementing and enforcing appropriate processes and standards across the organization.
Work with the Engineering, Product, Delivery and Architecture teams to ensure that appropriate attention is given to 'Reliability Engineering.
Ensure the means exist to quickly recover a degraded service (instrumentation, runbook, tooling etc).
Drive SRE education across the wider team to improve quality and reliability.
Significant experience working through the definition, design, release and run cycle of software products to markets - using Agile/Scrum methodologies.
Experience with DevOps, ITIL, Cloud Services, IT Infrastructure and Operations, including environment standup, server builds, firewalls, security and regulatory compliance.
Experience implementing and managing Logging, Monitoring and Alerting framework for hybrid cloud or third party services using AppDynamics, Splunk, Data Dog.
Experience with agile development (Scrum, Kanban, etc.) and within an agile project team (agile in ability to perform cross-functional tasks quickly) – balance multiple projects and collaborating closely with other development teams.
Design and implement an environment that is reliable, resilient, observable and sustainable to maintain.
Define a comprehensive set of SLIs,SLOs and Error Budget that are used to drive capacity decisions and respond to performance concerns.
Collaborate with and support multiple teams to incorporate features, provide technical answers, recommendations and deliverables in a timely manner.
Demonstrate passion for team success, personal growth and being part of something big.
Sharing knowledge learned with coworkers and other extended teams. Effectively respond to Monitoring alerts, incident tickets,Change Request,Problem Records, email requests or other channels coming in to Site Reliability Engineering team .
Mentor, coach, and develop a high performing Site Reliability team. Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams .
Work with the recruiting team to attract, onboard, and retain diverse top talent.
Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing.
Part of Center of Excellence and driving best practices for DevOps and Site Reliability Engineering.

Site Reliability Engineer /DevOps Lead

Amadeus Software Labs India PVT LTD

12.2015 - 11.2019

Directly manage a large geographically distributed team of talented Site Reliability Engineers
Managing cross functional teams like RES-OPS,Factory operation, Application Management Team ,automation team, capacity management team ,Demand and Budget Team ,DBaas Team,DevOps Team.
Lead the DevOps process by working closely with Engineering Management in an agile process
Drive best practices in Site Reliability Engineering and insure Secure, Scalable, Performant, and Highly Available Service
Effectively respond to Monitoring alerts, incident tickets,Change Request,Problem Records, email requests or other channels coming in to Site Reliability Engineering team
Collaborate with various internal teams to provide a high quality customer experience, and support
Escalate issues as needed to product development or service engineering team per documented procedures, while at the same time establishing a contingency plan to eliminate any intermittent service disruption
Have a maniacal focus on high availability, performance, scalability, and security of mission critical production services, 24x7x365 and keeping SLA Green & high CSAT
Participates in major incident resolution; driving recovery and handling executive level communications
Mentor, coach, and develop a high performing Site Reliability team
Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams
Interface with Dev/QA/OPS teams to identify root cause analysis and re-instrument triggers to prevent future Outages
Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement
Troubleshoot issues across the entire stack: hardware, software, application and network
Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization
Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress
Handling communication and providing transparency on major site issues to the executive management team
Manage on-call rotations and provide inputs to the team and partners to sustain SLAs
Document root cause analysis reports and develop standard operating procedures
Work with the recruiting team to attract, onboard, and retain diverse top talent
Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing
Demonstrable knowledge of Linux operating system internals, cloud hosting technologies (E.g., AWS, Azure, Google),
containerization platforms (Docker, Kubernetes), architecting and implementing CI & CD structures.
Experience operating applications on public and private cloud solutions
Evaluates various architectural solutions and implementations and supports development and deployment of solutions as determined by the SRE team
Ensures effective implementation of the department budget. Prepares financial statements and monthly forecasts and reports. Prepares and analyzes monthly financial performance and makes budget and new technology recommendations
Identify and contribute to solutions for reducing services outages, reducing alert noise, improving monitoring, and
helping define and our services reach Service Level Indicators(SLIs) and Service Level Objectives (SLOs)
Improving the availability and responsiveness of internal and external components and Platforms through the application of engineering best practices, tooling and instrumentation advances and cross organizational coordination
Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness
Automate, Automate, Automate.Work closely with Automation team with knowledge of configuration tools like Puppet, Chef or Ansible
Managed 2+ million Euro project for Supply chain automation on PEGA platform saving 8400 hours of effort each year
Managed 2+ million Euro project for database migration project with new technologies
Oversee multiple projects across all phases of development
Created virtualization roadmap to reduce physical server count by 80% and save ongoing annual savings of 1+ Million Euro by the virtualization of all physical servers
Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment
Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel
Provide leadership and direction to SRE staff that are responsible for break-fix, uptime and reliability for core services, distribution, and customer access network elements and related interfaces
Involved in RHEL5 Deco and Hardware replacement Project where i played a role of planning/procuring Hardware and software, creating the schedule plan for the server migration, Resource Management ,controlling the procurement plan to fit in the migration plan and budget
Involved in Stack7, Stack5 and EPL Deco Project where i led efforts in Planning: collecting requirements, H/W selection, capacity assessment, defining and sequencing activities to form schedule, coordinated with budget team and ensuring that the plan is executed
Involved in Stack8 Database project for the H/W selection, capacity planning, H/W Procurement and automation from serifa
Involved in DBaaS Kanban and Scrum Meetings
Create a governance structure with daily, weekly and monthly reports and provide snapshot of the same to internal and external stakeholders
Update runbooks, tools and documentation to help prepare on-call teams for future incidents.

Technical Specialist

HCL Technology Limited -IOMC

10.2012 - 12.2015

Experience in working with teams/customers across multiple geographies and cultural back ground
Project Manager having of experience in handling Operations team in 24/7 manner & keeping SLA Green & high CSAT
Knowledge of preparing detailed IR (Incident Report) and RCAs (Root Cause Analysis)
Create the Shift Plans considering the business criticality, team member vacations
Experience in managing customer escalations effectively
Guide the Shift leads and L1/L2 teams on following the processes
Establish and fine tune Operations/Support process, Define SLAs, Handover Process, come up with necessary templates
Experience in creating and presenting Dashboard and Reports for Customers and Senior Management
Hands on experience on managing risks, stakeholder expectation, and liaison with quality assurance team for a 360 degree service delivery
Identify gaps and risks around people, process and technology and drive continuous improvements
Take ownership of platform maintenance, SW and HW vendor relations, license expiry, certificate expiry
Coordinate Standard, Normal and Emergency changes with the Operations and Engineering teams
Get the change requests logged on time for production deployment, obtain approvals from customer and CAB on time for deployment
Drive the engines for Service Requests, Incidents, Problems and take timely action on corrective and proactive measures
Create a governance structure with daily, weekly and monthly reports and provide snapshot of the same to internal and external stakeholders
Manage a team of site reliability engineers to deliver cloud infrastructure and production operations for Unilever Consumer Goods services
Plan and manage capex/opex budgets that improve and sustain reliability
Experience hiring, mentoring, and training a staff of engineers
Guide/Mentor team members in troubleshooting application/web/system related issues
Provide technical leadership for a hybrid private/public cloud enterprise solution
Expert in performance and people management with a focus on mentoring and motivating engineers
Involved in AIX Cloud Migration Project. The objective was to migrate the Aix Operating system from old power4/p5 machines to power 7 private cloud infrastructure
Worked with offshore resources delivering SRE ,Capacity Management function of different technologies (Unix, Wintel, DB, Storage etc., Cloud Technologies - Amazon Web Services, MS-Azure, CLM etc) across different Landscapes globally
Support BAU operations (Storage Uplift, Memory Uplift etc.)

Senior Operations Professional

IBM India Pvt Ltd

03.2012 - 09.2012

Track record of driving extremely high levels of availability for web services with resilient architecture, scalable infrastructure, technical operations automation, 360 degrees of application performance monitoring, and a highly trained operations staff
Create and influence system design, standards, and processes that improve production reliability
Drive excellence for reliability through maintenance of SLAs, efficient process, automation development, engineering reliability back into applications and maximizing performance
Communicate effectively and present team progress to upper management
Support and educate DevOps teams and consumers on using the standardized infrastructure services
Solid foundation in Linux or Windows administration and troubleshooting
Be responsible for the overall uptime and performance of critical application cloud services
Identify recurring problems and build the tools and processes to prevent problems from recurring
AIX 4 Deco and H/W Replacement Project (project coordinator for CNA Insurance client) The main objective was to move all the Database servers from their current stack to latest version in order to prevent security breaches and ensuring that we are in compliance to PCI standards
Assessing the actual spend vs.Planned and triggering the appropriate risk response
Manage stakeholder expectation and project information
Monitor project performance, manage project communication

Training Consultant & System Administrator

IBM Partners & Client's

03.2008 - 03.2012

Imparted training on IBM AIX for IBM internal and external clients across India and abroad like Dubai, Colombo, AbuDhabhi .
Imparting training for IBM AIX throughout India and abroad
System Administration for UNIX server environment which includes IBM AIX 6.1, 5.3, and 5.2
Server performance and Capacity analysis
Providing Server performance statistic for issue analysis, future growth plan
Sever Sizing, Server modeling, projection of workload on different server configuration
Preparing monthly and quarterly health check reports for servers and providing recommendations
Provide L2&L3 support as Unix System Administrator and performance analyst
Supporting More than 700 AIX servers with 24/7 on call support
Worked with Remedy Ticketing Tool to Resolve the UNIX issues
System Administration for UNIX server environment which includes IBM AIX 6.1, 5.3, and 5.2
Installing new software / file sets, packages and other third-party applications when required by the developer or business
Technology level (TL / ML), Fixes, emergency fix, APAR Updations
Disk management, LVM administration and File system management
User and Group Management, sudo access requested by customer
Worked on HMC, AMM and Lantronix console to manage the Servers
Good experience in Administration of HMC for IBM system p administration and Worked on versions 3 to 7.3.2. Good understanding in logical partitioning (LPAR), DLPAR
Backup and Restore Files, vg etc
Managing processor, memory utilization and customizing swap file system
Managing file systems which include creating, mounting, setting ACL permission
Managing files and directory permissions
Configuring and maintaining NETWORKING (NFS, SSH, RSH, FTP) services
Creating VG and LV and extending LV, mirroring VG
Performance monitoring using vmstat, Topas, Nmon and netstat
Maintaining LPAR, WPAR, HMC, HACMP, VIOS, IBM P SERIES servers
Configuring and Troubleshooting VG, LPAR, WPAR, RAS, SECURITY on IBM AIX platform
Scheduled periodic, future jobs using crontab and at

System Administrator

Accenture pvt ltd, M/S Crux Consultants

01.2007 - 03.2008

Monitoring performance of Unix and windows server
Creating VG and LV and extending, mirroring VG
Manage, monitor, install, upgrade, configure, and troubleshoot servers and network equipments
Provides server support and maintenance, uses various utilities to trouble-shoot, repair, and check configuration of server
Running Performance monitor commands ( nmon , vmstat , iostat )
Managing paging space
Remote Desktop Support through Logmein Rescue
Trouble shooting of Hardware through Third Party Tools
Installing Software Applications & Fine Tuning the PC's
Remote monitoring and management of 2003 and 2008 servers, Active Directory based domains

Education

M.B.A - Information Technology

Sikkim Manipal University

Sikkim

07.2009 - 07.2012

Bachelor of Engineering - Electronics & communication

Visvesvaraya Technological University

Karnataka

06.2002 - 06.2006

Skills

DevOps and Site Reliability Engineering

Docker,Kubernetes

Linux,Windows,Storage,Networking,Load balancer

Cloud : AWS,AZURE,GCP,Openshift,Openstack

Safe Agile and Certified Scrum Master Role,Kanban,Lean

ITIL V3/V4 ,Service Design

Team Management

Jira,Confluence ,sharepoint,Microsft Office,Project

Programming language : Python ,Java ,C,Angular

Configuration tools: Puppet, Chef,Ansible,Terraform,

Log index/search platforms: Splunk

Scripting : Python , Bash,Shell

Monitoring & Data analysis : BCO,Prometheus, Grafana,nagios,InfluxDB,AppDynamics,ELK,DataDog

CI/CD : Jenkins, Gitlab, BitBucket,AzuDevOps

Network troubleshooting :TCP, DNS, IPv6 and tcpdump

Operating System : Aix , Linux(RHEL), Windows ,Solaris,HP UX

Database : Oracle ,MySqL,MongoDB,DB2

Web-server: Apache 2x, iPlanet and IIS

App Servers: Web logic, Web sphere ,Apache tomcat, Jboss and Glassfish

Incident Management ,Change Management,Problem Management

App Support and Release Management

RabbitMQ Messaging, Redis Caching,REST API services

Chaos Engineering

Certification

Microsoft Certified Technology Specialist on Windows server 2008 Enterprise network infrastructre ( MCTS: SR6257517)

Onsite Experience

Onsite ( Dubai,Abu dhabi,Colombo,Singapore ) work experience.
Frequently travelled to England and Germany (Project work)

Award/Appreciation

Received Star Award and Spot Award from Danske IT for the successful implementation of SRE/DevOps best practices.

Received High-Flyer Award for efforts towards Decommissioning of old N/W,RHEL5 ,5+ year old H/W without any customer impact.
Received DREAM SQUADRON AWARD from Amadeus Labs for making GOPS an Inspired, Motivated and High performance team.
Won TPE Amadeus division leve l Men's Volleyball Tournament in 2017.
Received Special Appreciation Award from the Manager for the significant role in Stack5,stack7 and Stack8 Migration.
Received appreciation from company high level management for the presentation to client on ecommerce project, Handled the technical discussion - with client and the clients were impressed for in-depth knowledge on the subject.

Received best capacity planner award in HCL from the client Unilever .

Received Best Team Lead award for transforming the team into SRE/DevOps Mindset .

Received Certificate for Amadeus Systems Security Training completion.

Additional Information

PERSONAL DOSSIER:
Date of Birth: 10 July, 1983
Passport No: Z5888325
Valid Till: Feb 3rd,2030
Languages Known: English, Hindi and Kannada

Timeline

Site Reliability Engineering /DevOps Manager

Hitachi Digital Services

03.2022 - Current

Site Reliability Engineer / DevOps Lead

Eurofins IT solutions India Pvt Ltd.

03.2021 - 03.2022

Site Reliability Engineering Team Manager

Danske IT And Support Services India Pvt Ltd

06.2020 - 02.2021

Site Reliability Engineer /DevOps Lead

Amadeus Software Labs India PVT LTD

12.2015 - 11.2019

Technical Specialist

HCL Technology Limited -IOMC

10.2012 - 12.2015

Senior Operations Professional

IBM India Pvt Ltd

03.2012 - 09.2012

M.B.A - Information Technology

Sikkim Manipal University

07.2009 - 07.2012

Training Consultant & System Administrator

IBM Partners & Client's

03.2008 - 03.2012

System Administrator

Accenture pvt ltd, M/S Crux Consultants

01.2007 - 03.2008

Bachelor of Engineering - Electronics & communication

Visvesvaraya Technological University

06.2002 - 06.2006