PROJECT : Service Management Operations & Delivery
Designation : Senior Systems Engineer
Role : Senior Problem Manager (ITIL Service Management)
Domain : Infrastructure support
Client : CANADA
Responsibilities
- Implement processes, procedures, and approaches for clear and concise feedback to customers and demonstrated compliance with contracted service agreements.
- Responsible for smooth transition/setup of Service Management projects.
- Performing Proactive/Reactive Capacity management through daily checks and trend analysis to ensure that sufficient capacity (in terms of Disk Space) is available for business..
- Identify and assist implementation of initiatives to improve resource usage.
- Identify obsolete Operating Systems and Hardware's in existing environment and provided recommendations to client to upgrade/replace, in order to ensure smooth operation.
- Reviewing and approving RCA documents prepared by Problem Management team for completeness and effectiveness before being sent to Delivery Managers for final approval.
- Ensuring timely submission of RCA documents to Client within specified timelines.
- Liaise with different support teams and stakeholders for progress of aging Problem Tickets and ensure sufficient focus is placed on these tickets.
- Reviewing weekly KPI reports with management to identify focus areas and identify action plans for same.
- Preparing and presenting Monthly Problem and Incident Management Reports to Client and Management.
- Identify opportunities for process or service improvements in order to improve stability of existing system.
- Auditing Problem Tickets on regular basis to ensure that process is followed diligently.
- Management of escalations within team.
- Managing staffing to ensure sufficient availability to manage operations.
- Provide training and guidance to team on processes and techniques used.
- Preparing Ad-hoc reports, based on requests from Management.
- Setting goals for team members and keeping members updated on progress of these goals over assessment year.
- Ensure timely submission and approval of timesheet and shift allowance of team members.
- To engage and work with relevant support teams to investigate root cause of Major Incidents.
- To work with support teams to formulate preventative measures/recommendations based on findings from investigations.
- Create Incident Review Report to identify process gaps that may have caused delay in resolution of Incidents.
- To document findings/action items resulting from Incident review and Root cause investigations.
- To follow up on action items for each problem and drive towards closure.
- To ensure that RCA documents are published to client as per quality and SLA defined by process.
- As part of proactive problem management, perform tower wise Infrastructure Incident Analysis (for all priorities) on weekly and monthly basis.
- To prepare presentation data from analyzed incident records and there by preparing deck/presentation to publish reports to client for respective infra towers on monthly basis.
- Identify trends from incident analysis (Weekly/Monthly/Quarterly) to determine recurring issues in environment, and open problem records as required. Follow up with support teams to formulate preventative & corrective measures to ensure action items are implemented in timely fashion.
- To take part in CAB meetings and MOP review calls for solutions/changes identified as part of Problem Investigation.
- To identify Service Improvement opportunities within existing system, formulate task force and action plans, and to ensure those recommendations are implemented in timely fashion.
Service Improvements/Key Projects
- Setting up automation of various day-to-day activities of Support teams:
Worked with Automation team to develop scripts, that helped to reduce workload of support teams in terms of incidents. Scripts to clear up disk space and pull up Disk space trends were setup, in order to reduce overall number of incidents.
- Formulating generic set of Server Recommendations:
Worked with various server support teams to identify list of generic server recommendations, which were then shared with client for implementation. These recommendations led to major increase in overall stability of existing environment on implementation.
- Set up Capacity planning for Disk Space:
Worked with server support teams to set up Capacity Management process for Disk Space, in order to ensure that sufficient capacity is available for business. Multiple health checks were setup on daily basis to check usage and trending and accordingly counter measures were put into place.
Worked with server support teams to identify unstable/problematic servers in existing environment, which were unstable due to high uptime. Based on identified list of servers, regular reboots were scheduled for such servers in order to improve stability.
Identified servers that were running with non-standard system drive size and worked with clients and support teams to expand drives accordingly. This ensured that there was reduction in overall number of disk space incidents on these servers.