Results-oriented and dedicated Staff Engineer with expertise in Site Reliability Engineering (SRE), seeking a challenging role where I can utilize my skills and experience to design and manage highly available, scalable, and reliable systems for optimal performance and user experience.
Objective:
Results-oriented and dedicated Staff Engineer with expertise in Site Reliability Engineering (SRE), seeking a challenging role where I can utilize my skills and experience to design and manage highly available, scalable, and reliable systems for optimal performance and user experience.
Summary of Qualifications:
- 13+ years of progressive experience in Site Reliability Engineering , with a proven track record of leading successful teams and delivering impactful results.
- In-depth knowledge of building and managing large-scale, distributed systems and cloud-based architectures.
- Proficient in implementing DevOps practices, continuous integration and deployment (CI/CD), and infrastructure as code (IaC).
- Excellent leadership and team management skills, with a focus on fostering collaboration, professional growth, and a high-performance culture.
- Extensive knowledge of building, scaling, and maintaining distributed systems and cloud-based architectures.
- Strong problem-solving and troubleshooting skills with a proactive approach to identifying and resolving complex technical issues.
- Expertise in performance tuning, monitoring, and capacity planning to optimize system performance and reliability.
- Skilled in programming and scripting languages such as Python, Golang, and Bash.
- Familiarity with containerization technologies like Docker and orchestration frameworks like Kubernetes.
- Experience with private cloud and public cloud platforms such as AWS, Azure, and Google Cloud Platform, including various services and tools.
- Excellent leadership and team collaboration abilities, demonstrated through successful cross-functional engagements.
- Strong communication skills, with the ability to convey technical concepts to both technical and non-technical stakeholders effectively.
Professional Experience:
- Led a team of 6 SREs responsible for designing, building, and maintaining highly reliable and scalable infrastructure for Rakuten Travel's critical services.
- Developed and executed strategic initiatives to improve operational efficiency, system reliability, and performance.
- Conducted performance reviews, provided mentorship and coaching to team members, and fostered a culture of continuous learning and professional growth.
- Actively participated in the recruitment and hiring process, ensuring the team is staffed with top talent.
- Designed and managed scalable and highly available infrastructure on private and cloud platforms (e.g., AWS, GCP), ensuring the reliability and performance of critical applications.
- Implemented infrastructure as code (IaC) using tools such as Terraform and Ansible/Chef, enabling efficient and consistent deployments.
- Developed and maintained robust monitoring and alerting systems, ensuring proactive identification and resolution of system issues.
- Conducted performance testing, capacity planning, and optimization efforts to improve system efficiency and resource utilization.
- Collaborated closely with development teams to optimize application architectures for scalability, fault tolerance, and efficient operation.
- Participated in incident response and on-call rotations, effectively troubleshooting and resolving critical incidents.
References:
Available upon request
Worked as a SRE engineer
Programming Languages: Python, Java, Bash, SQL,Golang
undefinedOracle Cloud Infrastructure Certified Associate
Oracle Cloud Infrastructure Autonomous Database Certified Specialist
Oracle Cloud Infrastructure Certified Associate