Request a Demo

IT Incident Management

IT Incident Management hero image

What is IT Incident Management?

IT incident management is a process within IT service management (ITSM) that aims to restore service operations after an issue is detected, and minimize the effects of on a business and end users.

The incident management process consists of various steps for service restoration, including but not limited to issue detection, incident creation, prioritization / classification, investigation and analysis, remediation, remediation verification, incident closure and post mortem analysis.

Defining, Prioritizing and Classifying an Incident

An IT incident is typically defined as an unplanned interruption or reduction in the quality of an IT service. Although prioritization of incidents can vary from organization to organization, incident priority is typically determined by two factors:

  • Impact, or degree of failure, and how many end users does the incident affect.
  • Urgency, or the importance of the services affected, relative to the mission of the organization.

Impact and urgency are also influenced by many different factors including:

  • Customer or business need
  • Financial impact
  • Service criticality
  • Business risk
  • Component failure impact analysis
  • Legal requirements
  • Et cetera.
 

ITSM Deep Dive: Prioritization, Escalation, and Alerting

IT Incident Management Priority Model

Watch the Webinar

Incidents are also typically divided into two buckets:

  • Normal Incidents: A scenario in which we have a disruption in context of a service definition such as a SLA. Most, but not all, impact users. Some normal incidents, such as failures that trigger redundancies, do not impact users directly but should be resolved before they do.
    Resolvers work through normal incident management in linear steps: logging, categorization, prioritization, initial diagnosis, escalation, investigation and diagnosis, resolution and recovery, and closure.
  • Major Incidents: Incidents have a large, impactful effect on the organization, or have time constraints are defined as major incidents. Working through major incidents involves case management workflows as opposed to linear steps, meaning hypothesis testing and probing through experimentation.

Systems should be categorized by importance and have SLAs around how long they can be unavailable before escalation. Impact and urgency will determine if normal incident or major incident processes are followed, and when SLAs exceeded, the organization has run out of time for experimentation and must move onto the IT service continuity / disaster recovery plan.

Incident Management Roles and Responsibilities

Because the different levels of incident management trigger different processes for response, it is critical to define roles and responsibilities for execution. These roles and responsibilities define who will drive process improvement, report key performance indicators (KPIs), and execute and enforce process workflow. They also define lines of communication between the IT team, the rest of the organization, vendors, and third parties.

To the right, is an example of the potential roles and responsibilities involved during a major incident. For more information, download our white paper, Streamlining the Major Incident Resolution Process: Define, Plan, Staff and Communicate.

Improving IT Incident Management

Minimizing the effects of service interruptions has a direct effect on a business’s bottom line. The cost of downtime in businesses can be astronomical. According to Everbridge’s State of IT Incident Management Report, the average cost per minute of an unplanned downtime is $8,662 US, which represents more than half a million dollars per hour.

While most organizations implement ITSM solutions to help manage incidents throughout the incident management lifecycle, one area that can be optimized is the time it takes to assemble the IT response team, also known as the response process. In the event of an IT incident, our research has shown that it takes IT organizations an average of 27 minutes, maxing out at 150 minutes in some cases, to assemble the response team. By automating this process, organizations are able to engage the response team in 5 minutes or less, minimizing the negative impacts of an incident and reducing the overall incident management timespan.

In the midst of a crisis, there is no time to hunt for the right people or write the message language. IT Service Alerting tools, as defined by Gartner, can reduce “mean time to respond” by automating the manual process typically associated with the response process.

Support staff responsible for managing a critical incident should have the ability to:

  • Contact the appropriate teams for any given incident
  • Contact those who are on call without hunting for the information
  • Immediately start a technical conference bridge with the right people
  • Inform the stakeholders and business management
  • Notify key customers and impacted users
  • Send messaging that is compliant with HIPAA and other regulations
  • Apply remedial actions according to predetermined, automated workflows

The 2017 State of IT Incident Management

Incident Management Related Costs

Learn what IT professionals have to say about the state of incident management, and what challenges they face, especially when it comes to managing and responding to major IT incidents. This original research explores the tools, processes, and costs associated with incidents, as well as the most likely causes for downtime and outages.

  • Companies heavily invested in ITSM systems
  • Majority of companies use MTTR to track incident resolution
  • IT incidents cost organizations an average of $8,662 USD per minute
  • IT incidents impacting employee productivity and IT teams the most
  • Network outages are the most reported cause of IT incidents

Get the Report