

Senior Site Reliability Engineer with 6+ years of experience supporting and scaling high-availability, mission-critical systems in FinTech and Enterprise environments. Strong background in incident management, automation, CI/CD, monitoring, and cloud-native platforms. Proven ability to improve system reliability through SLI/SLO-driven observability, automation, and blameless postmortems, while operating under peak traffic and regulatory constraints. Passionate about building resilient systems that balance reliability, velocity, and operational excellence.
• Owned reliability and availability of mission-critical, high-throughput payment systems, consistently maintaining 99.999%+ availability in a regulated fintech environment.
• Defined, monitored, and continuously improved SLIs, SLOs, and SLAs using Splunk and Grafana, enabling proactive detection of anomalies and faster incident response.
• Led incident management for Sev-1 and Sev-2 production incidents, performing blameless postmortems and deep root cause analysis (RCA) to prevent recurrence and improve system resilience.
• Leveraged GenAI to summarize incidents, logs and RCA findings, accelerating post-incident reviews and stakeholder communication.
• Designed and maintained CI/CD pipelines using Jenkins, enabling automated deployments, security remediations, and configuration changes, reducing manual intervention and human error by ~30%.
• Implemented zero-touch patching for approximately 80% of application servers by automating patch validation, dry runs, and rollout strategies, significantly improving security posture and operational efficiency.
• Executed and supervised high-risk production changes during peak transaction periods, ensuring system stability under extreme load and maintaining error budgets.
• Collaborated closely with development, security, and platform teams to implement scalable and fault-tolerant architectures aligned with reliability best practices.
• Supported capacity planning and performance monitoring by tracking transactions-per-second (TPS) metrics and system behavior under peak traffic conditions.
• Worked extensively with Kubernetes and Linux-based environments to improve application availability, scalability, and deployment reliability.
• Leveraged AWS cloud infrastructure to support production workloads, ensuring secure and reliable operations across environments.
• Integrated operational data into Power BI dashboards for monthly and quarterly reliability, availability, and performance reporting for internal stakeholders and clients.
• Utilized ServiceNow for incident, change, and problem management, ensuring compliance with organizational change management and audit requirements.
• Strengthened system security through secure configurations, controlled access, and collaboration with security teams on remediation efforts.
• Supported reliability and operational stability of multiple enterprise applications, transitioning from traditional application support to platform and reliability engineering responsibilities.
• Administered and migrated large-scale SharePoint environments, automating data migration workflows using Python scripts and ShareGate, reducing manual effort and migration risk.
• Built and managed 70+ SharePoint application sites, ensuring high availability, access control, and operational consistency.
• Deployed and configured Red Hat Linux servers on Microsoft Azure, including OS hardening, package management, and environment setup for production workloads.
• Installed, upgraded, and supported Java-based applications on Linux servers, including dependency management, SSL certificate configuration, and secure zone deployments.
• Implemented secure access controls using CyberArk, improving credential management and reducing security exposure.
• Acted as primary point of contact (PoC) for multiple production applications, handling incidents, troubleshooting, and coordination with development teams.
• Supported incident resolution and change implementations, ensuring minimal downtime and adherence to operational best practices.
• Collaborated with clients during requirements analysis, deployment, and production support, gaining early exposure to reliability, scalability, and system design considerations.
• Assisted in onboarding new projects and setting up operations at new locations, contributing to process standardization and operational readiness.