IT incident management is a process within IT service management (ITSM) that aims to restore service operations after an issue is detected, and minimize the effects of on a business and end users.
The incident management process consists of various steps for service restoration, including but not limited to issue detection, incident creation, prioritization / classification, investigation and analysis, remediation, remediation verification, incident closure and post mortem analysis.
An IT incident is typically defined as an unplanned interruption or reduction in the quality of an IT service. Although prioritization of incidents can vary from organization to organization, incident priority is typically determined by two factors:
Impact and urgency are also influenced by many different factors including:
Incidents are also typically divided into two buckets:
Systems should be categorized by importance and have SLAs around how long they can be unavailable before escalation. Impact and urgency will determine if normal incident or major incident processes are followed, and when SLAs exceeded, the organization has run out of time for experimentation and must move onto the IT service continuity / disaster recovery plan.
Because the different levels of incident management trigger different processes for response, it is critical to define roles and responsibilities for execution. These roles and responsibilities define who will drive process improvement, report key performance indicators (KPIs), and execute and enforce process workflow. They also define lines of communication between the IT team, the rest of the organization, vendors, and third parties.
To the right, is an example of the potential roles and responsibilities involved during a major incident. For more information, download our white paper, Streamlining the Major Incident Resolution Process: Define, Plan, Staff and Communicate.
Minimizing the effects of service interruptions has a direct effect on a business’s bottom line. The cost of downtime in businesses can be astronomical. According to Everbridge’s State of IT Incident Management Report, the average cost per minute of an unplanned downtime is $8,662 US, which represents more than half a million dollars per hour.
While most organizations implement ITSM solutions to help manage incidents throughout the incident management lifecycle, one area that can be optimized is the time it takes to assemble the IT response team, also known as the response process. In the event of an IT incident, our research has shown that it takes IT organizations an average of 27 minutes, maxing out at 150 minutes in some cases, to assemble the response team. By automating this process, organizations are able to engage the response team in 5 minutes or less, minimizing the negative impacts of an incident and reducing the overall incident management timespan.
In the midst of a crisis, there is no time to hunt for the right people or write the message language. IT Service Alerting tools, as defined by Gartner, can reduce “mean time to respond” by automating the manual process typically associated with the response process.
Support staff responsible for managing a critical incident should have the ability to:
Learn what IT professionals have to say about the state of incident management, and what challenges they face, especially when it comes to managing and responding to major IT incidents. This original research explores the tools, processes, and costs associated with incidents, as well as the most likely causes for downtime and outages.