Data centre CSI: Investigating facility downtime

by | Nov 6, 2017 | Articles, Consulting

Future-tech’s forensic engineering services help our data centre customers root out the causes of unplanned downtime and reduce the risk of future incidents. Ensuring availability at almost any cost has been of the central tenets of the data centre industry since its inception. Despite an almost obsessive focus on avoiding unplanned downtime, it still occurs. To date in 2017 there have already been a number of high-profile outages at airlines, co-location operators, financial services companies and IT services providers. These resulted in reputational damage, as well as direct costs running into the tens of millions, for the businesses concerned and affected.

Unsurprisingly, organizations are highly motivated to understand the root-causes of outages as part of internal due-diligences processes and to hopefully prevent future incidents. Some companies will prefer to conduct their investigations internally whilst others choose to use specialist third parties such as Future-tech to provide thorough, unbiased and confidential forensic analysis services. Data centre technology companies, such as Schneider Electric, also provide similar services however they usually just investigate why the equipment they provided has failed, our investigations are generally more wide-ranging.

We have seen a steady increase in demand for our forensic engineering services and have conducted half a dozen investigations across medium to large sites over the last 12 months. We are usually called into sites where there has been an outage to initially establish the root-cause of the downtime. Most engagements also involve suggesting measures to harden a site, or specific infrastructure equipment, to prevent repeat events.

Root-cause analysis

Our engineers can be on site within hours of an outage although in most cases it is usually a few days before the investigation begins. This is because the client is often more focused on getting critical systems back up and running rather than immediately understanding what caused the outage.

Ground zero in data centre forensic investigations is often the IT equipment, even if the ultimate cause is likely to be the mechanical and electrical (M&E) infrastructure. Our in-house team will often begin by examining server components and identifying why those particularly resistors, capacitors, transistors and integrated circuits have failed. From there, the team can trace back up the power chain until the ultimate cause of the incident is identified.

A faulty UPS, DRUPS, Breaker or Bypass could be the direct cause however human-error is also often a contributory factor. A facility operator who hasn’t been trained adequately, or simply makes a mistake, can turn what could have been an isolated equipment fault into the cause of a site-level outage. According to a 2017 survey by Uptime Institute at least 38 per cent of outages were caused by human error; this includes recent high-profile outages at a major U.S. airline and a public cloud services provider. Human error is often the result of wider management failures rather than isolated problems with a specific individual. Issues such as design compromises, training budget cuts, staffing reductions, lack of preventative maintenance or life-cycle planning and selecting the lowest cost vendor can all directly contribute to a human error incident or exacerbate the aftermath associated with one.

Reporting and prevention

Once the root-causes of an outage have been identified, we will detail our findings in a report. This usually includes suggestions for replacing equipment, altering a facility’s design and ultimately hardening the infrastructure against further incidents. It can also include guidance for how to improve, or even overhaul, management practices. These reports are highly sensitive documents and as such there are strict non-disclosure agreements in place to prevent them being publicly circulated, that said some clients have been very open about lessons learned and how the site has been improved to reduce the risk of a downtime in the future.

Although some reports may make hard reading for IT or facilities teams, the ultimate aim of our forensic engineering investigations is not to apportion blame but rather to provide detailed analysis of the incident and suggest improvements. Developing and implementing measures to prevent future downtime is not only important to the companies concerned but is also rewarding for our engineers. The ideal scenario is that our suggestions, upgrades and/or design changes are adopted and that the next time there is a major issue those suggestions and changes protect the data centre from another actual outage. We have had this happen on several occasions. It is this scenario that our engineers find most rewarding, making changes and improvements, and then seeing them do the job when it really matters.

Future-proofing

Despite continual improvements in facility M&E equipment and design as well as IT infrastructure, it seems the time taken to investigate and recover from outages is actually increasing. A survey conducted by Uptime in 2017 revealed that only 8 per cent of organisations were less concerned about IT outages than they were a year ago. Advances in so-called distributed resiliency or geo-redundancy, where networks and IT take on a bigger role for maintaining service levels from redundant M&E equipment, was meant to lessen concerns about downtime. But it seems that the complexity inherent in distributed resiliency is actually increasing the perceived risk of outages.

However, organisations such as Uptime and Future-tech are now developing measures to improve resiliency across highly distributed environments. There are also efforts to more accurately track data centre outages, such as the Data Center Incident Reporting Network, and enable operators to more easily share knowledge about incidents. This is something that has not happened in the data centre industry traditionally.

Looking further ahead, we believe the monitoring and management of data centres is becoming increasingly automated and software-driven. For example, use of datacentre infrastructure management (DCIM) tools is increasing, albeit gradually, which in the long-term should make it easier for some operators to self-diagnose outages. Data centre technology suppliers are embedding more intelligence into M&E equipment to enable proactive and preventative maintenance that over time should help to reduce downtime. Artificial and Augmented Intelligence is also being developed that can predicted equipment failures with the use of data gathered from multiple equipment sensors across multiple data centre asset portfolios. Even with these measures in place it is unlikely we will ever be able to eliminate unplanned downtime altogether. While this is the case, the need to find out why outages have occurred and how we can protect the facilities affected from future occurrences is still vital.

If your data centre has experienced an outage or, perhaps more importantly, if you wish to proactively assess a data centre’s resilience, design and infrastructure before a major outage occurs contact us on info@future-tech.co.uk for a confidential discussion with one of our Design Engineer or Consultants.