I recently had the pleasure of sitting down with Don Tennant of IT Business Edge to discuss some IT Incident Response best practices as part of his recent piece, “Step 1 in IT Incident Resolution: Make Sure the Right People Know About the Problem.” In Don’s words, “we can all agree that when an IT organization experiences a system disruption or outage, the sooner the right people know about it, the sooner the issue will be resolved, and the better the outcome for the business.”
We spoke at length about the different types of IT incidents that organizations need to prepare for, as well as the many challenges they face as part of the resolution process, and, of course, the importance of minimizing downtime and reducing the MTTK through the use of effective IT Service Alerting solutions. Here is Part 1 of the full conversation:
- Why is effective use of an automated alerting and escalation solution such an important asset for CIOs and IT departments?
Today, for most companies across nearly all industries, IT does more than just supporting the business. In some cases, IT is the business. Therefore IT issues very often translate instantaneously into business issues such as revenue loss, drop in employee productivity, deterioration of brand image, etc. Therefore, it is more important than ever before, that IT issues are resolved as quickly as possible to limit the negative impact on the business. IT departments consistently have to deal with incidents of all severity levels, from minor service disruptions to total outages.
In these critical situations where every minute counts, the IT department needs to be able to reach out to the on-call IT specialists whether they are on site or remote so they can resolve and restore the service quickly. Automated IT communications, alerting and escalation solutions play an important role in restoring services faster by connecting the right on-call people with the right information..
Having the right critical communications solution in place also enables IT departments to alert the “need to know” people immediately including the CIO and sometime the customers, via text, phone, email, etc. and ensure the necessary next steps are taken quickly and efficiently to mitigate the issues.
- What are some of the most prevalent and harmful incidents that IT departments should be prepared to deal with at any point?
While there’s a near infinite number of potential incidents IT departments may have to face, the most critical ones for the companies are the ones impacting their core business. E-commerce companies, for example, can’t afford to have their website down even for a couple of minutes. This is no different for citizen, resident or student facing e-services which are now the preferred way to interact with the different administrations. We can find similar examples in the industry, when the IT supporting the inventory application, supply chain, shipping or invoicing has a problem then the employee productivity is at risk. There are different severities of incidents but they all should be appropriately classified, prioritized and addressed by the right IT teams to improve IT incident response.
- What are some reasons why an IT incident may not resolved quickly enough?
There can be several reasons for that. Obviously, if the IT department does not have any kind of monitoring in place, it’s most likely that they will hear about the IT issue when it’s already too late, i.e when users start calling in the service desk to complain. In these circumstances, you see the IT teams play catch up, invite all kinds of people to unorganized war rooms and are always in a firefighting mode. In some cases, IT departments have all the tools they need but because they are on disparate systems and not integrated, this can generate inefficient work flows. Another reason can be found with alert fatigue when everyone on the IT teams receive all the alerts generated by all the monitoring tools with the same sense of urgency. Again, this does not contribute to resolving issues quickly. Now, from a communications standpoint, we’ve seen many instances when there is an overlap in IT communication methods and a lack of collaboration mechanisms. Inefficient processes like these can result in long wait times in trying to find and connect with the right person for the job. Things can get even more complicated for a global company that needs to reach their stakeholders located in many countries and across time zones. Inefficient and obsolete processes and systems cause significant delays and ultimately, inhibit the ability to reach employees, vendors and/or customers.
- What are usual communications breakdowns involving IT incidents?
There are three typical breakdowns that happen. First, you want to quickly identify and communicate with your on-call IT specialists across the different IT teams (Network, Applications, Middleware, Server, Databases…) as they need to be involved in the incident resolution process as soon as possible. We call them the “Resolvers”. If the e-commerce application is down, you may want to inform the different “Stakeholders” such as the application owner, the CIO, the VP of Sales and the Marketing team so they are ready to address any bad publicity they’ll find on social networks. And third, the “End-users”, whether they are your internal employees who can’t use email because it’s down, partners who can’t access your portal, or your online customers who can’t buy services or products from you any longer. You want to make sure you communicate to these different groups with the appropriate information and in a timely manner.
Check back for part 2 of the conversation next week. Interested in hearing more about IT Incident Response? Check out our whitepaper, “IT Incident and Disaster Alerting: Using Communication to Improve Response.”