Monitoring and Incident Management Specialist
Adept at proactively monitoring and managing real-time events and alerts to ensure seamless operation of IT systems. Skilled in identifying and resolving diverse incidents, ranging from server ping failures and service disruptions to backup failures and disk space issues. Possesses expertise in handling complex scenarios such as cluster node and resource downtimes, port scans, snapshots, and storage issues on clusters.
Key Responsibilities:
- Real-Time Event Management:Monitor and respond to critical incidents, including server ping failures, service disruptions, backup failures, and more.
Effectively acknowledge, resolve, and escalate issues to level 4 support engineers when required.
Process requests promptly, ensuring minimal downtime and efficient problem resolution.
- Database Management:Monitor database-related activities and promptly respond to requests from Application Support and Development teams.
Handle restoration requests and address recovery errors to maintain the integrity of the database.
- Client Interaction:Engage in direct client interaction as an application vendor to troubleshoot and resolve technical issues.
Maintain a client satisfaction rate above 99% and consistently achieve an issue resolution target above 90%.
- System Health Checks:Perform daily health checks on Windows and Azure systems to ensure optimal performance.
Execute periodic driver/firmware upgrades to enhance system reliability.
- Alert Management and Ticket Reduction:Utilize tools such as ServiceNow, IMM, Lenovo X-Clarity, MegaRAID, vSphere vClient, Everbridge, Tivoli, SCOM, Smarts 9.4, XYMON, and RDC.
Analyze and suppress false alerts to streamline ticket management and reduce overall ticket count.
Technical Proficiency:
- Tools: ServiceNow, IMM, Lenovo X-Clarity, MegaRAID, vSphere vClient, Everbridge, Tivoli, SCOM, Smarts 9.4, XYMON, RDC.
Achievements:
- Consistently maintained client satisfaction above 99%.
- Successfully met and exceeded the 90% issue resolution target.