Data Centre Best Practices Article 4 – Critical Facilities Operational Management

by | Feb 11, 2020 | Articles, Best Practice

This article is the fourth in a series of Data Centre Best Practice articles provided by Future-tech Ltd.

In this article we seek to highlight the basic best practices associated with operating a data centre from a Critical Facilities Management perspective. This involves the Monitoring, Management and Maintenance of the building infrastructure which ultimately supports the install IT load and the services it provides.

The essential element in this is to ensure the continuity of the services hosted by the data centre and ensure the proper functioning of the infrastructure on which these services depend.

These practices should be applied in order to achieve the intended levels of availability or resilience for which the site was designed. It should be noted however that no amount of Monitoring, Management or Maintenance can achieve levels of availability or resilience if the site has not been designed, constructed or previously maintained to the levels required to sustain the levels of availability required by the business.

Facilities Operational Management

The site Critical Facilities / Mechanical and Electrical (M&E) Engineering team should be installed on site at least 6 weeks prior to live operation to ensure that all documentation is complete and accurate, that site processes and procedures are robust and tested and that the team has had the opportunity to reconfigure and operate the site infrastructure without live load attached.

With the above in mind if possible, provide the opportunity to involve the site team in the design process to guarantee the potential for reliable site operations and maintenance. Additionally have the site team involved in the commissioning process to verify the correct operation of installed infrastructure and also allow the site engineering team to have the opportunity to be involved in the development of site snagging lists and final sign off for correctly completed work.

Once Operation and Maintenance (O&M) manuals have been handed over by the contractor, it is imperative that all internal staff, contractors and suppliers keep these documents up to date and accurate. Additionally, “As Built’ record drawings should also be continually updated to reflect the latest state of the installation to reflect any changes or additions that are made to the site.

Method statements should be produced and agreed with all parties including service customers prior to all maintenance or project activities. For the purposes of their own notification requirements it is important that customers are made aware of any site maintenance or project activities. This includes the precautions that are to be taken to maintain service agreements or inform of any potential disruption or risk to service that may result.

Critical site budgets should be ring-fenced and separated from non-critical site budgets to reduce the chances of funding being eroded without an understanding of the inevitable consequences.

All new site staff or visitors should be trained in the specific rules and safety aspects related to the site in addition to detailed training on any system(s) they will be responsible for operating or maintaining. To make sure this is in place a fully documented and regularly updated training programme should be available for all site staff.

An organisation chart showing command chain and responsibilities should be available detailing all interfaces between the Facilities Management, IT, and Security groups. This should include a roles and responsibilities matrix covering all activities at the data centre as well as critical facilities job descriptions.

A comprehensive document library should be readily available for site staff to reference and be constantly updated. This should include up to date “As Built” drawings, commissioning records, maintenance records, training records and certifications, accurate O&M manuals, site policies and procedures etc.

A complete, accurate and regularly updated list of critical spares should be maintained including key supplier details and re-supply order point notifications. The site operator should agree with their suppliers an inventory of critical spares which are most likely to fail and therefore be held against this eventuality.

Such practices can dramatically reduce the Mean Time To Repair (or Recovery, MTTR), figures used in reliability and availability calculations. Additionally, maintenance contracts should incorporate guaranteed maximum call out times.

Ensure the adoption and maintenance of an accurate and constantly updated Risk Register. This should include a comprehensive record of risk notifications and risk mitigations as well as the potential risks themselves. Clear, accurate and comprehensive labelling of all infrastructure components should be in place which readily allows identification and tracing of distribution paths. This should also include normal position indicators for valves, switches, breakers, indicators etc. (magnetic buttons or cheap coloured stickers to indicate normal operating status. Intended to highlight expected position to provide an immediate view of something that may have changed during an event).

A visual indicator highlighting current site alert status should be displayed in a visible place for all site staff. Examples would be exposure to increased risk during maintenance, customer site visits, terrorist threat, running on generators etc.

A rigorous planned maintenance and inspection programme should be in place which includes an element of Predictive and Condition based maintenance. This may include; Thermal / IR Scans, Vibration analysis, regular analysis of lubricating fluids, water, fuels, coolant, air quality etc. along with trending analysis of maintenance records and equipment performance and condition. An example of scheduled maintenance and inspection programme would be the regular cleaning and/or replacement of air filters

The purpose of maintenance it to maintain the data centre site in a “like new” condition. This is to both reduce operating risk due to equipment failure and also to ensure that equipment operational efficiency remains high to reduce electrical consumption and therefore energy costs.

Ensure that Emergency Operating Procedures are available, visible and easily accessible in all critical equipment plant rooms, along with appropriate contact numbers and escalation points. To aid with this place wall mounted single line drawings and schematics in all plant rooms.

Defined procedures and should exist for all switching operations and maintenance activities, these should be used and followed consistently. In addition, a governance process should be in place that validates both the quality and proper completion of any maintenance activity prior to sign off. Preventative Maintenance task completion rate should be greater than 95%.

Data should be recorded to allow trending and analysis on equipment performance over time including (but not limited to), energy efficiency, reliability and maintenance overhead. This should lead into a lifecycle planning process for scheduling and funding the replacement of equipment based on either performance degradation / increasing maintenance costs or anticipated end of life.

Records of all Saves, Near Misses, Incidents, Critical Events and any service impacting outages should be kept which include dates, times, people, equipment involved, results of root-cause analyses and any lessons learned. This should result in the ability to predict, identify and mitigate risks or take corrective action to prevent reoccurrence.

Generator starting upon the failure of utility power is potentially a weak point in the continued operation of a data centre site.  It is therefore critical important that the generator is well maintained and tested. The generator batteries should be checked and the generator started on a regular and pre-determined schedule according to the manufacturer’s recommendations.

Generators should be run on full building in a simulation of a total utility power supply failure on a regular basis, at the very least annually. This is often referred to as a “Black Building Test”. Many operators shy away from this test due to the perceived risk, however not performing this test and guaranteeing the correct operation of these systems has proved to be a far greater risk for many data centres. There have been several very well publicised outages as a direct result of not performing this test.

Future-tech have been designing, building and managing business critical data centres since 1982. The experience gained in being involved in the data centre sector from the outset has resulted in Future-tech sites achieved 99.999% uptime during 35+ years of operation. Future-tech has a team of experienced, skilled and highly trained in-house Data Centre Engineers capable of properly maintaining and operating business critical data centre sites of all sizes. For more details please contact Richard Stacey on 0845 900 0127 or at rstacey@future-tech.co.uk