Clear, accurate and comprehensive labelling of all infrastructure components should be in place which readily allows identification and tracing of distribution paths. This should also include normal position indicators for valves, switches, breakers, indicators etc. (magnetic buttons or cheap coloured stickers to indicate normal operating status. Intended to highlight expected position to provide an immediate view of something that may have changed during an event).
A visual indicator highlighting current site alert status should be displayed in a visible place for all site staff. Examples would be exposure to increased risk during maintenance, customer site visits, terrorist threat, running on generators etc.
A rigorous planned maintenance and inspection programme should be in place which includes an element of Predictive and Condition based maintenance. This may include; Thermal / IR Scans, Vibration analysis, regular analysis of lubricating fluids, water, fuels, coolant, air quality etc. along with trending analysis of maintenance records and equipment performance and condition. An example of scheduled maintenance and inspection programme would be the regular cleaning and/or replacement of air filters
The purpose of maintenance it to maintain the data centre site in a “like new” condition. This is to both reduce operating risk due to equipment failure and also to ensure that equipment operational efficiency remains high to reduce electrical consumption and therefore energy costs.
Ensure that Emergency Operating Procedures are available, visible and easily accessible in all critical equipment plant rooms, along with appropriate contact numbers and escalation points. To aid with this place wall mounted single line drawings and schematics in all plant rooms.
Defined procedures and should exist for all switching operations and maintenance activities, these should be used and followed consistently. In addition, a governance process should be in place that validates both the quality and proper completion of any maintenance activity prior to sign off. Preventative Maintenance task completion rate should be greater than 95%.
Data should be recorded to allow trending and analysis on equipment performance over time including (but not limited to), energy efficiency, reliability and maintenance overhead. This should lead into a lifecycle planning process for scheduling and funding the replacement of equipment based on either performance degradation / increasing maintenance costs or anticipated end of life.
Records of all Saves, Near Misses, Incidents, Critical Events and any service impacting outages should be kept which include dates, times, people, equipment involved, results of root-cause analyses and any lessons learned. This should result in the ability to predict, identify and mitigate risks or take corrective action to prevent reoccurrence.
Generator starting upon the failure of utility power is potentially a weak point in the continued operation of a data centre site. It is therefore critical important that the generator is well maintained and tested. The generator batteries should be checked and the generator started on a regular and pre-determined schedule according to the manufacturer’s recommendations.
Generators should be run on full building in a simulation of a total utility power supply failure on a regular basis, at the very least annually. This is often referred to as a “Black Building Test”. Many operators shy away from this test due to the perceived risk, however not performing this test and guaranteeing the correct operation of these systems has proved to be a far greater risk for many data centres. There have been several very well publicised outages as a direct result of not performing this test.