Data Centre Availability & Reliability

The relationship between data centre Availability, Reliability, Uptime and Downtime is often misunderstood so this basic guide offers some suggestions on how they can be understood and used in the context of data centre operations.

Availability provides information about how time is used. Reliability provides information about failure free periods. Both are described in % values, which can lead to confusion and misunderstanding.

Availability IS NOT equal to Reliability except in the non-existent world of no downtime and no failures.

Availability provides information about the length of time a service is provided.

Reliability provides information about the failure-free interval. Both are described in % values.

At its simplest level Availability is defined as:

Availability = Uptime / (Downtime + Uptime)

The time units are generally hours, and the time base is typically 1 year. There are 8760 hours in one non leap year. The denominator of the availability equation is therefore at least 8760 hours (or 525600 minutes), if the results are being studied annually.

Availability for data centres is frequently specified in ‘nines’ notation. For example, ‘three-nines’ availability corresponds to 99.9% availability. A ‘five-nines’ availability corresponds to 99.999% availability. It is generally accepted though that the ‘nines’ do not constitute a useful definition of data centre service availability as even a 25 millisecond power outage could take down the entire data centre for several hours or even days.

It is little comfort therefore to know that even with a ‘five-nines’ (99.999% availability), data centre there will be an exposure of around 5 minutes per year service outage. Many data centres are designed to provide in the order of 99.999% service availability so do not even attempt reach the high availability level of 31 seconds outage per year resulting from a 99.9999% (‘six-nines’), service availability.

Inherent Availability

Inherent Availability looks at availability from an overall design perspective as would be the case in a data centre building. In this case Availability (A_i), is the long-term average fraction of time that a repairable system is in service and is satisfactorily performing its intended function (given an infinite mission time). This is always measured in terms of percentage of uptime vs. downtime; the closer to 100% the better.

Inherent availability references the overall availability when factoring in the downtime of the system based on Corrective Maintenance (CM) activities. This definition of Availability is what is typically referred to by those responsible for systems maintenance.

Inherent Availability (A_i) = MTBF / (MTBF + MTTR)

In the equation above, Mean Time Between Failure (MTBF), as the name suggests, is the average time between failure of hardware modules or systems. It is the average time a manufacturer estimates before a failure occurs in a hardware module.

Mean Time To Repair (MTTR), is the time taken to repair a failed hardware module or system. In an operational system, repair generally means replacing the hardware module. Therefore, hardware MTTR could be viewed as mean time to replace a failed hardware module. It should be a goal of system designers to allow for a high MTTR value and still achieve the system reliability goals.

MTTR for a software module or service can be computed as the time taken to reboot after a software fault or service outage is detected. Thus software / service MTTR could be viewed as the mean time to reboot or recover after a software or service fault has been detected. In a data centre context this could mean the recovery of the IT services and applications rather than merely the supporting building power and cooling elements.

This element significantly complicates the creation of data centre SLAs as operators (particularly colocation operators) may not have control over the IT services and functions. This is also a frequently overlooked element which does not take into account the fact that restoring IT services may take far longer than restoring building services such as momentary power delivery issues. Often it is merely simpler to concentrate on the building infrastructure itself when discussing availability.

Operational Availability

Operational Availability is defined a little differently as follows:

Availability (Operational) (A_o) = MTBM / (MTBM + MDT)

MTBM = Mean time between Maintenance.

MDT = Mean Down Time

The operational world tends to reference Operational Availability as define by the equation above.

The mean time between maintenance (MTBM) includes all corrective and preventive actions associated with an installed system (compared to MTBF which only accounts for failures).

The mean down time (MDT) includes all time associated with the system being down for Corrective Maintenance (CM) including delays (compared to MTTR which only addresses repair time) including self-imposed downtime for Preventive Maintenance (PM) although it is preferred to perform most PM actions while the equipment is operating, hence the dual path / Concurrent Maintainability designs preferred for most data centres.

A_o is inevitably a smaller availability number than A_i because of naturally occurring operational deficiencies and inappropriate actions, particularly the well reported issues related to Human Factors and operational errors in data centres.

Downtime per Year

Downtime per Year is merely a more intuitive way of understanding Availability. The table below compares the availability and the corresponding downtime. Contractual obligations often reference percentage availability or downtime but often without full understanding of service impact.

For instance, a contract to guarantee service availability from the data centre to a level of 99.99% availability still legitimately allows for 52 minutes per downtime. While this might be a single 52 minute period of downtime it could equally be 52 separate occurrences of one minute downtimes or even 3120 separate one second instances. Clearly the latter would be totally unacceptable to most customers but would technically still be within SLA.

Based on the ITIC curve on which IT equipment power supplies should be designed, any power loss of more than 20 to 30 milliseconds is likely to result in the shutdown of all IT equipment. This could potentially be the entire data centre so legitimately a site with a guarantee of 99.99% availability (or uptime) could legitimately have 3120 or more very brief outages, each of which powers down every device in the entire date centre.

The table below gives the relationship between percentage availability and downtime. It should be noted that The Uptime Institute™ no longer reference percentage availability due to the inherent flaws detailed above. For instance 99.9999% Uptime (‘six-nines’) is greater than might be expected from a Tier IV site and yet still allows for 31 seconds of unplanned downtime per year.

Downtime per year is merely a more intuitive way of understanding the availability. The table below compares the availability and the corresponding downtime:

Annual Availability Annual Downtime

90% (1 nine) 36.5 days/year

99% (2 nines) 3.65 days/year

99.9% (3 nines) 8.76 hours/year

99.99% (4 nines) 52 minutes/year

99.999% (5 nines) 5 minutes/year

99.9999% (6 nines) 31 seconds/year

Regardless of compliance with KPIs the service expected by the customer and provided by the supplier will ultimately based on SLAs. It is often the case that collocation providers will agree to high levels of availability (up to 100% availability in some cases) and sign a contract to that effect but very frequently accept the risk associated with the contract and merely accrue for failure!

An additional key point in reference to availability, particularly percentage availability is that it is essential to recognise that different elements of the data centre service stack will be affected in dramatically different ways by periods of downtime. Power for instance needs to be referenced in the millisecond range and cooling in the minute range when determining likely impact. IT and network services will depend to a large extent on application and service resilience and redundancy, but all need to be considered separately and include references to both partial and total service loss.

The key basic elements of the data centre service stack that need to be considered in terms of availability are listed in the section below. If any one of these elements is compromised or unavailable it may result in loss of service from the data centre and therefore considered to be “Downtime”.

Data Centre Availability Reference Points / KPIs

To address some of the issues described above it is recommended that data centre operations reference more than simple Availability / Downtime at the building level. Some suggested key data centre availability reference points are suggested below:

Power Availability: Agreed power quality levels for Voltage, Frequency and Harmonics for an agreed level of current.

Power KPI: No total loss of power or reduction in power quality supporting the critical load for the period of the ITIC cycle. Note that in the case of redundant power supplies this will be measured on the basis of a single functioning circuit rather than all redundant circuits. For dual supplies this would mean either A or B circuit is available rather than both circuits, something can be overlooked in service contracts. Performance in terms of loss of power, reliance and capacity will ultimately be determined by SLA with stakeholders of customers. Measured in fractions of seconds.

Cooling Availability: Within an agreed SLA referencing the environmental envelope corresponding to ASHRAE TC9.9 2021 Equipment Thermal Guidelines for Data Processing Environments (5th Edition).

Cooling KPI: Correct levels of cooling / heat removal are available within the limits of the relevant SLA. Measured in minutes.

Network Availability: Both internal and external within agreed latency and packet loss parameters for an agreed level of bandwidth.

Network KPI: Internal and external networks available within the limits of the relevant SLA and service agreement. Measured in Minutes.

Server / Compute Availability: Is the compute platform available and is there sufficient capacity to be able process the required workloads?

Server / Compute KPI: No impact on workload processing capability based on SLA and service agreement. Measured in minutes.

Storage Availability: The data storage medium is available and the data stored is able to be retrieved and used or modified.

Storage KPI – No impact on data storage or data access that would inhibit workload processing capability based on SLA and service agreement. Measured in minutes.

Security Availability: Security systems and security personnel are working and available at minimum levels or above during agreed periods.

Security KPI: No security breaches or loss of security systems. No impact or degradation in the services being delivered from the data centre. A DDoS attack or loss / theft of data would come under this KPI. Measured on resulting impact.

Application / Software Availability: Includes OS, Middleware and Application. The compute platform may be switched on and available but may not be capable of performing the required work.

Application / Software KPI: No interruption in services or degradation of service from the data centre as a result of application problems. Measured in minutes.

Commonly used KPIs throughout the data centre sector relate to service continuity and availability and in this context it is common to see percentage availability or uptime being quoted in marketing materials or included in SLA documents without being fully qualified or understood.

Although it is inappropriate to measure overall data centre availability as a percentage it is acceptable to measure each of the above as a percentage of 8760 hours (Annual).

It is also worth noting the potentially large discrepancy between IT and facilities downtime in a data centre, particularly in terms of the restoration of services (MTTR). A very brief power outage of less than a second may result in total loss of IT services that may potentially take hours or days to recover from due to the need to restore large data volumes or re-sync very large databases.

Reliability

Reliability is defined as the ability of an item to perform a required function under given conditions for a given time interval.

Reliability (R), is the probability that a system will operate properly for a finite period under design operating conditions without failure.

For a constant failure rate λ, reliability as a function of time is: R(t) = e-λt

Reliability (uppercase “R”) — A term used in common language that reflects the overall state of a system.

reliability (lowercase “r”) — a number, typically expressed as a percentage that reflects the probability and frequency of failures and is expressed as a probability over a given period of time.

The reliability is expressed as a probability (0-1 or 0 to 100%). Therefore, the reliability of a component may be expressed as 99% that it will work successfully for one year. The reliability is essentially an indication of probability that an item will not fail in the given time period.

A highly generalised curve for the failure rates of components over time is the bathtub curve. After a long operating period the items will fail at an increasing rate due to wear and other time related functions.

Availability vs reliability comparision

Availability is a commonly used term Reliability Modelling. Availability can be defined as the long-term average fraction of time that a repairable component or system is in service and satisfactorily performing its intended function. For example, if the power is off for one hour in a year, but the rest of the year the power is on, the availability of electrical power for that year is 8,759 hours divided by 8,760 hours, which is 0.999886.

An availability of 0.99999 (‘five-nines’) could mean that the system was down for 5.3 minutes (or 315 seconds) per year. It would make no difference in the availability calculation if there was one 5.3-minute outage, or 315 one-second outages. It could also be one outage of 1.77 hours in 20 years. In all three cases, the availability is 0.99999. Obviously 315 1-second outages would be very destructive to the operation. However, one outage of 1.77 hours every 20 years might be considered acceptable.

Maintainability

The ability of an item under given conditions of use, to be retained in, or restored to, a state in which it can perform a required function, when maintenance is performed under given conditions and using stated procedures and resources.

When a piece of equipment has failed it is important to get it back into an operating condition as soon as possible, this is known as maintainability. MTTR data supplied by manufacturers will be purely repair time which will assume the fault is correctly identified and the required spares and personnel are available. The MTTR to the will always include any logistic delay associated with replacement parts. This demonstrates the value of on-site storage for critical replacement items and for the retention of reliable external suppliers. The MTTR will also include factors such as the ready availability (on site presence), and appropriate skills of the site maintenance engineers.

Equipment redundancy and the ability to provide concurrent maintainability are vital to a data centre operation and the reduction of MTTR times. The aim of regular maintenance is to keep critical equipment in a ‘like new’ condition as far as possible to minimise failures. These operational elements must therefore be considered and designed in at the outset along with the elimination of Single Points of Failure (SPoF) and specification of reliable equipment for all business critical data centres.

Finally

In discussions regarding both Availability and Reliability in data centres it is always worth remembering Garrett Harding’s formula:

(Technology Reliability) X (Human Reliability) = System Reliability

Data Centre Availability and Reliability – An Explanation & Suggested Kpis