Mahesh Kale - Site Reliability Engineer (SRE) - FIS Global

Summary

Experienced SRE Engineer with 5+ years managing and optimizing digital banking applications. Proficient in Linux, AWS, Networking and DevOps tools. Skilled in applying SRE principles to improve system reliability, availability, and observability. Expertise in ITIL processes, incident management, root cause analysis, proactive monitoring, and performance tuning. Effective collaborator across cross-functional teams, driving operational excellence and seamless user experiences in regulated financial environments.

Overview

5

years of professional experience

Work History

Site Reliability Engineer (SRE)

FIS Global

Pune, Maharashtra

03.2021 - Current

Overview : 5+ years of hands-on experience in the USA banking domain, specializing in Retail and Digital Banking applications. expertise in application operations,automation, change and release management, and system reliability. Skilled in troubleshooting, monitoring design, SLA reporting, and client communications. Adept at ensuring high availability and performance of critical banking services in a fast-paced and regulated environment.
System Maintenance: Performed regular maintenance and health checks on Linux (RHEL 9) and Windows servers, ensuring optimal performance, patch compliance, and system availability.
System Monitoring: Utilized monitoring tools such as Splunk and Dynatrace to continuously track system health, performance, and uptime, enabling proactive issue detection and resolution.
Implemented SRE best practices by defining SLAs, SLIs, and SLOs, and building reliability dashboards to monitor and improve system performance and availability.
Automation: Developed Shell scripts to automate repetitive operational tasks, improving efficiency, reducing manual errors, and accelerating incident resolution times.
Managed 24/7 production support, promptly resolving P1/P2 incidents with clear Root Cause Analysis (RCA) and effective follow-up to prevent recurrence.
Led incident resolution efforts during major outages, coordinating cross-team collaboration and performing detailed Root Cause Analyses (RCAs) to prevent recurrence.
Logging & Monitoring: Implemented end-to-end monitoring and log analysis for applications and servers using Splunk, Dynatrace, and Tryambak, ensuring high service availability, faster
incident detection, and enhanced system observability and reliability.
Managed Linux-based infrastructure, including user access control, file system management, and system patching to maintain security and stability.
Performance Tuning: Continuously monitored and tuned Linux and Windows servers to optimize performance by adjusting system parameters and resource allocation as needed.
Incident Response: Responded promptly to system incidents and alerts, ensuring timely and effective resolution to minimize downtime and impact on services
Incident Management: Managed production incidents by promptly addressing user and client tickets. Coordinated with cross-functional teams to assess issues, investigate root causes, and implement timely resolutions, ensuring minimal service disruption and high customer satisfaction.
Weekend Maintenance: Executed weekend maintenance tasks including patch management and deployment, application failover from primary to secondary data centers when required, and traffic routing between clusters using Control-M jobs. Also handled branding activities to ensure consistency across environments.
Change Management: Participated in the change management process and CAB meetings to ensure smooth deployment of system updates while minimizing risks and maintaining operational stability.
Day-to-Day Issue Research: Investigated tickets from client support, end users, development teams, and other internal groups. Opened and documented tickets for newly discovered issues through monitoring and daily operations. Collaborated with internal teams to research problems and identify defects.
Performance Tuning: Tuned Linux systems to improve performance and response times by analyzing metrics and adjusting system parameters as needed.
Services Management: Managed system services by starting, stopping, and restarting using systemd, init.d, or service commands to ensure continuous service availability.
System Administration: Experienced in user management, disk partitioning, LVM (Logical Volume Management), configuring Cron jobs, managing runlevels, and working with systemd for service and system management.
Managed user accounts, groups, and sudoers configurations, and performed OS-level security hardening to ensure system security and compliance.
Troubleshoot system performance issues, network latency, and service downtimes to restore optimal operations promptly.
Managed daily operational jobs, batch processing workflows, and user access requests to ensure smooth and secure system functioning.
Collaborated closely with development teams to design and implement resilient architectures that enhance system reliability and scalability.
Documentation : Created and maintained comprehensive documentation for system processes, configurations, and troubleshooting procedures to ensure knowledge sharing and operational consistency.
Training : Delivered training sessions to staff on system usage, security best practices, SRE principles, and troubleshooting techniques to enhance team capability and operational efficiency
AWS cloud services, including EC2, S3, VPC, IAM, RDS, Route 53, CloudWatch, and Lambda.
Configured VPCs, subnets, route tables, security groups, and network ACLs to support secure network environments.
Deployed and managed virtual servers using Amazon EC2 and implemented object storage solutions with Amazon S3.
Created and managed relational databases using Amazon RDS, including backup and recovery configurations.
Implemented identity and access management policies using AWS IAM roles, users, and permissions.
Applied concepts of Elastic Load Balancing and Auto Scaling to improve application availability and fault tolerance.

Education

Bachelor of Engineering Technology - Electronics And Telecommunication Engineering

Sinhgad Institute

Pune, India

06-2020

Skills

Linux Administration: 5 years managing & troubleshooting Linux (RHEL) servers

AWS: 3 years with EC2, S3, CloudWatch, IAM, basic networking

Dynatrace: 4 years in APM and root cause diagnostics

Splunk: 5 years in log monitoring, alerting, dashboard creation

Docker: 2 years containerization and environment management

Control-M: 4 years batch scheduling, monitoring, troubleshooting

ServiceNow: 4 years incident, change, and problem management

Windows : 4 years in server and user access management

Jenkins: 4 years in CI/CD pipeline configuration and deployment troubleshooting

Git: 2 years version control and collaboration

SRE Practices: SLAs, SLIs, SLOs, Error Budget, Incident Management, RCA

Timeline

Site Reliability Engineer (SRE)

FIS Global

03.2021 - Current

Bachelor of Engineering Technology - Electronics And Telecommunication Engineering

Sinhgad Institute