What is IT Incident Management?
IT incident management is a process within IT service management (ITSM) that aims to restore service operations after an issue is detected, and minimize the effects of on a business and end users.
The incident management process consists of various steps for service restoration, including but not limited to issue detection, incident creation, prioritization / classification, investigation and analysis, remediation, remediation verification, incident closure and post mortem analysis.
Defining, Prioritizing and Classifying an Incident
An IT incident is typically defined as an unplanned interruption or reduction in the quality of an IT service. Although prioritization of incidents can vary from organization to organization, incident priority is typically determined by two factors:
- Impact, or degree of failure, and how many end users does the incident affect.
- Urgency, or the importance of the services affected, relative to the mission of the organization.
Impact and urgency are also influenced by many different factors including:
- Customer or business need
- Financial impact
- Service criticality
- Business risk
- Component failure impact analysis
- Legal requirements
- Et cetera.
Incidents are also typically divided into two buckets:
- Normal Incident – A scenario in which we have a disruption in context of a service definition such as a SLA. Most, but not all, impact users. Some normal incidents, such as failures that trigger redundancies, do not impact users directly but should be resolved before they do.
Resolvers work through normal incident management in linear steps: logging, categorization, prioritization, initial diagnosis, escalation, investigation and diagnosis, resolution and recovery, and closure.
- Major Incidents – Incidents have a large, impactful effect on the organization, or have time constraints are defined as major incidents. Working through major incidents involves case management workflows as opposed to linear steps, meaning hypothesis testing and probing through experimentation.
Systems should be categorized by importance and have SLA’s around how long they can be unavailable before escalation. Impact and urgency will determine if normal incident or major incident processes are followed, and when SLA’s exceeded, the organization has run out of time for experimentation and must move onto the IT service continuity / disaster recovery plan.
Incident Management Roles and Responsibilities
Because the different levels of incident management trigger different processes for response, it is critical to define roles and responsibilities for execution. These roles and responsibilities define who will drive process improvement, report key performance indicators (KPIs), and execute and enforce process workflow. They also define lines of communication between the IT team, the rest of the organization, vendors, and third parties.
To the right, is an example of potential roles and responsibilities involved during a major incident. For more information, download our white paper Streamlining the Major Incident Resolution Process: Define, Plan, Staff and Communicate
Improving IT Incident Management
Minimizing the effects of service interruptions has a direct effect on a business's bottom line. The cost of downtime in businesses can be astronomical. According to Everbridge’s State of IT Incident Management Report, the average cost per minute of an unplanned downtime is $8,662 US, which represents more than half a million dollars per hour.
While most organizations implement ITSM solutions to help manage incidents throughout the incident management lifecycle, one area that can be optimized is the time it takes to assemble the IT response team, also known as the response process. In the event of an IT incident, our research has shown that it takes IT organizations an average of 27 minutes, maxing out at 150 minutes in some cases, to assemble the response team. By automating this process, organizations are able to engage the response team in 5 minutes or less, minimizing the negative impacts of an incident and reducing the overall incident management timespan.
In the midst of a crisis, there is no time to hunt for the right people or write the message language. IT Service Alerting tools, as defined by Gartner, can can reduce "mean time to respond" by automating the manual process typically associated with the response process.
Support staff responsible for managing a critical incident should have the ability to:
- Contact the appropriate teams for any given incident
- Contact those who are on call without hunting for the information
- Immediately start a technical conference bridge with the right people
- Inform the stakeholders and business management
- Notify key customers and impacted users
- Send messaging that is compliant with HIPAA and other regulations
- Apply remedial actions according to predetermined, automated workflows
How Everbridge Helps
In the event of an IT issue, Everbridge IT Alerting quickly connects the right on-call personnel with the right information using phone, email, SMS and mobile app alerts. Rules based automation, dynamic on-call scheduling and automatic escalation ensures that someone will respond and take ownership of the incident, regardless of time, day, location or device. And if an issue requires the whole IT response team, joining a conference bridge is only one click away.