Top Mistakes When It Comes To Data Centre Maintenance Pt 2

by | Jun 16, 2023 | Articles, Maintenance & Management

Following the first set of common mistakes I highlighted in Part 1 when it comes to data centre maintenance, here are some additional top mistakes I have seen over the last decade when it comes to Data Centre Maintenance and Management, and how they can be addressed with industry best practices:

Asset Sweating

Most data centre equipment has a life cycle of 10 – 15 years from the date of installation. UPS systems, fans and capacitors are areas where this differs, with batteries lasting 5 – 7 years depending on how well they are looked after.

With that in mind, why would you continue to run a data centre on equipment that is 12 or more years old?

The simple answer is that if spare parts are available, units should be able to perform if there has been and remains a good Planned Preventative Maintenance schedule in place. This is based on the individual piece of kit, the operating environment and how well it has been maintained.

Equipment will show you signs that it needs replacing when lots of smaller components begin to fail, or one piece of equipment has more repairs on it than another. This equipment has probably run harder than the others and will ultimately fail again, which is a good indication that the unit is past the end of its reliable operating life.

If equipment lifecycle changes have been budgeted, the kit will last 10 years if a proper PPM schedule is in place. If you do decide to sweat the assets, just be understanding that lots of small issues on a unit will build up to a larger critical failure – lots of small repair bills do add up.

Be aware that parts are not always available. Most OEM suppliers do continue to run parts after a unit is discontinued, but this is not indefinite. When the objective is to ensure uptime and availability of a business’s critical data centre, a proper PPM provider like Future-tech will have that difficult conversation with you when it comes to throwing good money after bad – I’ve had it many a time.

The provider will also make sure that appropriate lifecycle budgets are in place so that lack of budget does not become an issue for the replacement of problematic systems.

Price

A customer once said to me ‘Nobody in the business will care about the money I saved if the data centre goes down’ – it is a quote I will never forget. When it comes to data centre failures, a lot of issues arise when something that has been highlighted as a potential problem, and even quoted for repair, has been deemed ‘too expensive’.

I am not being dismissive of people’s spending or budgeting, but you wouldn’t buy a car if you had no money to put petrol in it would you?

If the data centre supports your business-critical operations – see where I’m going with this analogy now – then you need to ensure you have money for its repairs, operating costs and general maintenance to ensure a smooth, reliable and continuous operation.

When you buy new data centre equipment it will come with a warranty which, and in most cases will cover parts and labour from the OEM. The minute you don’t keep up with the Planned Preventative Maintenance schedule this warranty is null and void. So, the end user will budget for new equipment, budget for Planned Preventative Maintenance but then not budget for repair bills?

When you put it in black and white it makes no sense at all but believe me, it happens on a routine basis and has been the root cause of many very public data centre failures over the years.

How much should I budget for I hear you cry?

Well, that is the easy part – have a look or ask your PPM provider to tell you how much you spent the previous year on emergency callouts and equipment repairs, and that will give you a good starting point. You are going to renew the PPM contract, so why not allocate a budget to the smooth operation of the data centre as well? It really is as easy as that.

Human error

More than 70% of data centre outages are caused by human error and not by fault of infrastructure design. Furthermore, “mistakes” that lead to an outage can often be traced to poor decision-making by senior management and institutional failure, rather than simply errors by individuals… Let that sink in.

I often find myself in meetings with people who do not design, do not build and do not maintain data centres as a full-time job with many years of experience, and yet I am frequently told how it is!

As someone who’s often the expert in the room, who in any given year could have been in 116 data centres (Yes, that is how many Future-tech look after), have built three or four new data centres or reviewing a design for a 324MW campus, I say trust the true experts!

So, what has this all taught me?

Robust processes and documentation

Data centres should have an established and well-documented site-specific operational process, control methodologies and procedures.

While this creates a standardised approach to operations, every operational procedure carried out in the data centre should be completed in accordance with the appropriate documented, validated, and well-practised procedure in line with specific RAMS.

Creating, documenting, and maintaining these processes properly can consume a lot of time and resources, which is why Future-tech does this as standard for all our data centre clients. The benefits are a dramatic reduction in the risk associated with human error.

Staff training

Mandatory data centre-specific employee training is standard for all Future-tech Data Centre Engineers and goes a long way towards helping to eradicate human error. This ensures our engineers are data centre risk management and maintenance experts, which is imperative when it comes to an emergency callout to a data centre.

In an emergency, it takes a calm and well-trained expert to identify the root cause of a problem and take the correct action. Alternatively, send a plumber and hope for the best.

Inspections

Regular manual inspections are vital for continued operational reliability. Monitoring systems are not always accurate, so Future-tech data centre engineers take time to physically walk through and inspect all M&E systems in each data centre facility.

An AC engineer will only look at the mechanical cooling, a UPS engineer will only look at the UPS, bypass, and batteries. A data centre engineer will conduct a documented inspection of all systems, data centre environment and operation conditions in one holistic view.

This will help the end user to identify and manage easily correctable issues on a proactive basis and prevent larger-scale problems.

Physical security

It is important to limit data centre access to authorised personnel only. This helps reduce employee contact and conditions where human error can take effect without impacting the mission-critical functions of the rest of the organisation.

Those without training and experience should never be allowed into a data centre unless escorted and properly supervised. This also reduces the risks of malicious damage or security breaches resulting from the ability to physically connect to systems behind protective firewalls.  

Keep it clean and tidy

Data centres have sometimes been referred to as ‘clean rooms’ with the objective of being hermetically sealed and fireproof. So why leave a cardboard box, cable clippings, plastic wrap from server boxes, wooden pallets, polystyrene, and COFFEE CUPS in the data centre?

Even small data centres can cost hundreds of thousands to build and will contain double or treble that amount of IT equipment.

It is good practice not to have desks chairs and people regularly resident in the data centre spaces. Allowing this invites the storage of flammable materials, a build-up of dirt and rubbish and the greatest sin of all, bringing liquids into an environment with sensitive electronic equipment and high-power circuits. These are just a few small things that can help to reduce costly issues with your data centre.

I’ll leave you with a final example.

A new build data centre was being populated with new IT equipment, and an unknown member of staff did not put the fire system manual into practice when bringing in a large pallet of IT equipment.

While unpacking the IT equipment, particulates were released into the data centre and the fire system went into alarm. Instead of pressing the fire hold-off button, the staff member chose to leg it either out of ignorance or fear.

The gas inevitably discharged and a £20k bill was presented to the client to replace the discharged gas. This incident could’ve been avoided if it wasn’t for human error based on a lack of awareness and suitable training.

….

Richard Stacey is the Director of Operational Infrastructure at Future-tech, an accredited Uptime AOS Specialist who’s been working in data centres for over 10 years.