Senior Site Reliability Engineer with 9+ years of experience excelling in observability, monitoring, and alerting within the banking domain. Skilled in implementing Prometheus and Grafana for real-time system insights, alongside configuring Alertmanager for prompt incident response. Proficient in automating monitoring tools with Ansible and Python, and adept at collaborating with product teams to develop customized Kibana dashboards. Demonstrated ability to proactively enhance monitoring and alerting practices, optimizing infrastructure resilience and scalability. With a strong foundation in SLI, SLO, and the Software Development Life Cycle, bringing a deep understanding of system automation tools and a track record of ensuring system reliability and performance.
Prometheus
Grafana
Alertmanager
Kibana
Ansible
Python
AWS
Kubernetes
Docker
Jenkins
Linux