What IT outage lessons have we learned in 2015?
A 2014 Gartner study projected that ‘through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues.’
In an incident of planned downtime, there is no excuse for not informing stakeholders in advance and even unplanned incidents can be communicated rapidly (more on the best ways to do this later). Why are we not better prepared for the inevitable problems? Just as England grinds to a halt when a few cms of snow falls overnight, many companies are poorly equipped to ensure the quickest mean time to resolution of an IT incident. As we rely more on technology to execute every aspect of our business, it is crucial we do not ignore increasing our resilience.
Just a few hours’ outage for a big corporation is headline news, and depending on the nature of the business, can bring reports of multi-million-pound losses. Apple’s App Store went down for over 11 hours on 11th March 2015, alongside iTunes, iBooks, iCloud and even its Mac App Store. It was the single largest outage Apple has faced and happened due to an internal issue with its DNS [domain name system]. The estimated loss in revenue was valued at $25 million. A drop in the ocean for Apple, but a potential death knell for app developers and musicians.
What about the impact of a cyber-attack? The long-term effects of TalkTalk’s website hack will not be seen for a while, but evidence of vulnerability is not welcomed by consumers. So, how can a company evidence its capabilities for IT resilience? How can it be sure it will respond to an IT incident quickly enough to limit the effects?
IT outages or downtime are not issues that can be ignored and should receive immediate attention. Reducing the mean time to resolution improves customer satisfaction and a company’s reputation. By incorporating an effective Critical IT Alerting program into an IT Operations Policy, interruptions are organised and prioritised by incident type, location and severity and the most skilled, available IT expert is identified and supplied with up-to-date information to make the necessary repairs. If that expert is not contactable, the IT Service Alerting solution can find the next available engineer and contact them; continuing the process until confirmation is received that someone is en-route. If the downtime is likely to impact on stakeholders and customers, they too can be contacted and kept updated throughout any incident.