Introduction to Data Centre Best Practices
The single most important reason for a data centre’s existence is to maintain the availability of the services hosted within it and distributed from it. These are based on the IT equipment located within the building which require four key resources including security, network resilience and continuity of power and cooling.
There is an imperative on the data centre operator to maintain this service availability in the most energy efficient and cost effective manner possible. The following series of articles presents tried and tested practices focussed on attaining this goal. The intention is to highlight industry leading best practices, best known methods and behaviours that should be applied consistently across any data centre site in order to achieve this aim.
Data Centre Operations and Management
As described above a data centre is a building or a segregated area of a building whose sole purpose is to provide secure, highly available, and reliable IT services.
In order to achieve highly available service the site infrastructure and operations need to be closely synchronised, extremely well understood and documented with policies, procedures, staff and training that both identify and mitigate any risks that may potentially cause a service interruption or threaten service continuity. In order to achieve long term reliability, it is imperative that data centre facilities are operated and maintained to high standards and that all potential risks are identified, recorded and have a plan in place to provide an appropriate level of mitigation.
Operating risk is difficult to model and quantify as it relates directly to the competency of operating personnel, the adequacy of the training they receive and the effectiveness or processes in place. To minimise operating risk, operating and maintaining personnel must be suitably qualified to operate the equipment contained within the facility.
Quality of maintenance is critical to the reliable operation of any data centre. All maintenance activities should be fully documented and mean times between failure (MTBF) and mean times to repair (MTBR) records should be kept for the purposes of lifecycle analysis. Suppliers should be asked to keep records of such information and make such records available on request. This will increase the accuracy of risk analyses and performance trending studies.
The Data Centre Operations Team is typically made up of individuals from multiple organisations or departments, each with its own scope and responsibilities.
It is vital to successful data centre operations that these roles and responsibilities are very clearly defined and documented. Divided or uncertain responsibility increases risk, particularly in complex operational environments. Therefore all areas of responsibility, reporting lines and escalation mechanisms should be very clearly documented at the outset and made clear to all parties.
All responsibilities relating to specific building areas (offices / plant rooms etc.), should also be very clearly defined and detailed. These should extend to all IT, Security and Facilities staff and key suppliers. The clear determination of responsibilities and organisation remains an integral part of the overall regime of Operational Best Practice so the clear definition of responsibilities of the various individual, groups, departments and organisations with an interest in site operations is vital to operational performance.
All site operations personnel should be well trained and well practiced, confident and competent to operate the site infrastructure including at all times even during events that may be highly stressful such as equipment failure that may potentially result in loss of service. Specifically, personnel should be trained by the suppliers of the equipment installed on site to ensure the highest level of training and understanding. It is important that this is taken into account within tenders for new equipment.
A well run data centre will always consider Human Factors in relation to site operations and management.