Project: Application Operations | Environment: Production
Description: AppOps is an Application Operations Support Team at JPL which provides manages applications from different product Line. Currently I manage 8 different applications which is a team of 5 members.
Responsibilities:
• Onboard and stabilize various application to AppOps from different Product lines for various JAVA based applications.
• Assist with development and implementation of SRE solutions for large scale distributed web applications across multiple tiers.
• Provide architectural and practical guidance to software development to improve resiliency, efficiency, performance, and costs.
• Perform proactive daily system monitoring including reviewing system and application logs as well as responding to, triaging, troubleshooting and remediating incidents.
• Onboarding various alert mechanism across various applications.
• Responsible for facilitating and coordinating end to end incident life cycle including creation, triage, communication, mitigation, resolution and post incident review.
• Conduct roadshows across BU and product teams to enhance and discuss challenges, clarify doubts, share best practices for Incident Management
• Debugging end to end issues and resolving it within the SLA.
• Facilitate post incident review calls and create , track betterments to avoid recurrence and improvement in process.
• Responsible for facilitating the restoration of all sev1-3 incidents to normal service operation as quickly as possible with minimum disruption to BHN infrastructure.
• Creating custom and standard templates using HTML, CSS, JavaScript for different brands.
• Optimizing operating model and build a process around it.
• Creation of various Runbook and Knowledge Base on Confluence.
• Consolidate Assignment groups and configuration items in Service Now to manage SLAs.
• Collaborate with various teams for uninterrupted availability all the applications.
• Mentor other applications in team and make them self-dependent.
• Coordinated with Internal Support / Development / Infrastructure Teams for any production issues raised by users and resolve
within the stipulated SLA for multiple applications in JAVA.
• Involved in Job monitoring and fixing the job abends.
• Major role in working on Adhoc issues and solving the same and Incident management.
• Continuous monitoring of core services for availability using Nagios,ICINGA, Kibana, splunk and PagerDuty.
• Working in process improvement that reduces the workload on the team.
• Involved in understanding of DataMart & cube refresh process for loading the data from MI Database to SQL server 2012.
• Involved in support activities and perform fixes for the regular job abends related to services and servers.
• Involved in the document preparation(Run Book) for the resolved issues
• Involved in the daily meetings internally and weekly meetings with the clients.
• Working on Spark and Elastic Search issues and solving the same.
• Good Experience with ticketing tools like SMART, CSM and service now monitoring tools.
• Handled issues related to Nagios alerting tools.
• Troubleshoot Issues: Act quickly to analyze the available data and find the root cause of the problem / technical issues in the product. Develop a solution themselves or pass the problem on to other engineering team members, all the while providing users with progress updates.
• Product Development: Participate in all stages of the product development process, including designing, building, and testing. Working on useful tools such as internal software to automate key processes or platforms where customers can send inquiries and reviews.
• Improvements: Deal with product issues firsthand and suggest overall product improvements including features as per customer requirement. Proactively evaluate engineering processes and provide recommendations to increase efficiency.
• Command & Control: Fixing production issues within TAT, driving the call and updating the management on a regular basis without delays.
• Onboarding Application: This involves taking over support and operations work from Development team and setting a whole new process and achieve stabilization in production environment.
• SRE: Serving as a primary point responsible for the overall health, performance and capacity of storage, DNS, CDN and infrastructure systems to achieve 100% site up. Automating runbooks to pre-heal the impacted service with pre and post validations steps. Effectively respond as per defined SLA to Monitoring alerts, incident tickets, email requests coming into Service Reliability Engineer Operation. Document resolution run books and standard operating procedures.