The single most important reason for a data centre’s existence is to maintain the availability of the services hosted within it and distributed from it. These are based on the IT equipment located within the building which require four key resources including security, network resilience and continuity of power and cooling.
There is an imperative on the data centre operator to maintain this service availability in the most energy efficient and cost effective manner possible. The following series of articles presents tried and tested practices focussed on attaining this goal. The intention is to highlight industry leading best practices, best known methods and behaviours that should be applied consistently across any data centre site in order to achieve this aim.
Data Centre Operations and Management
As described above a data centre is a building or a segregated area of a building whose sole purpose is to provide secure, highly available, and reliable IT services.
In order to achieve highly available service the site infrastructure and operations need to be closely synchronised, extremely well understood and documented with policies, procedures, staff and training that both identify and mitigate any risks that may potentially cause a service interruption or threaten service continuity. In order to achieve long term reliability, it is imperative that data centre facilities are operated and maintained to high standards and that all potential risks are identified, recorded and have a plan in place to provide an appropriate level of mitigation.
Operating risk is difficult to model and quantify as it relates directly to the competency of operating personnel, the adequacy of the training they receive and the effectiveness or processes in place. To minimise operating risk, operating and maintaining personnel must be suitably qualified to operate the equipment contained within the facility.
Quality of maintenance is critical to the reliable operation of any data centre. All maintenance activities should be fully documented and mean times between failure (MTBF) and mean times to repair (MTBR) records should be kept for the purposes of lifecycle analysis. Suppliers should be asked to keep records of such information and make such records available on request. This will increase the accuracy of risk analyses and performance trending studies.
The Data Centre Operations Team is typically made up of individuals from multiple organisations or departments, each with its own scope and responsibilities.
It is vital to successful data centre operations that these roles and responsibilities are very clearly defined and documented. Divided or uncertain responsibility increases risk, particularly in complex operational environments. Therefore all areas of responsibility, reporting lines and escalation mechanisms should be very clearly documented at the outset and made clear to all parties.
All responsibilities relating to specific building areas (offices / plant rooms etc.), should also be very clearly defined and detailed. These should extend to all IT, Security and Facilities staff and key suppliers. The clear determination of responsibilities and organisation remains an integral part of the overall regime of Operational Best Practice so the clear definition of responsibilities of the various individual, groups, departments and organisations with an interest in site operations is vital to operational performance.
All site operations personnel should be well trained and well practiced, confident and competent to operate the site infrastructure including at all times even during events that may be highly stressful such as equipment failure that may potentially result in loss of service. Specifically, personnel should be trained by the suppliers of the equipment installed on site to ensure the highest level of training and understanding. It is important that this is taken into account within tenders for new equipment.
A well run data centre will always consider Human Factors in relation to site operations and management.
Availability and Reliability
Availability and Reliability are central to any data centre operation but how are these defined and measured?
The section below offers some basic insight in this area.
The probability of failure-free system operation over a specified time in a given environment for a given purpose
The probability that a system, at a point in time, will be operational and able to deliver the requested services
Always measured in terms of percentage of uptime vs. downtime; the closer to 100% the better.
Availability, in its simplest form: A = Uptime / (Uptime + Downtime)
Availability is the percentage of time when system is operational. Availability of a component can be obtained by the formula below:
Availability and Reliability – related but distinct
Availability takes into account the time that the system is out of service
Unreliable systems can have a high availability if there is a short restart time (Not useful in data centres particularly in relation to power supply)
Availability gives information about use of time. Reliability gives information about the failure-free interval. Both are described in % values.
Downtime per year is a more intuitive way of understanding the availability. The information below compares the availability and the corresponding downtime.
90% (1-nine) – 36.5 days/year
99% (2-nines) – 3.65 days/yea
99.9% (3-nines) – 8.76 hours/year
99.99% (4-nines) – 52 minutes/year
99.999% (5-nines) – 5 minutes/year_
99.9999% (6-nines) – 31 seconds/year
Note that 99.9999% Availability (6-nines) allows for 31 seconds per year of downtime. This might equate to a single 31 second outage or potentially 31 x 1 second outages, or even 62 x .5 second outages etc. If % availability is used to inform an SLA on power supply continuity it is unlikely to be of any great value in terms of operational availability without a great deal of qualification.Read our second in the series of Data Centre Best Practices articles here.
Future-tech have been designing, building and managing business critical data centres since 1982. The experience gained in being involved in the data centre sector from the outset has resulted in Future-tech sites achieved 99.999% uptime during 35+ years of operation. Future-tech has a team of experienced, skilled and highly trained in-house Data Centre Engineers capable of properly maintaining and operating business critical data centre sites of all sizes. For more details please contact Richard Stacey on 0845 900 0127 or at email@example.com