Distributed resiliency part one: cloud availability

by | Jan 25, 2018 | Articles, Design

Data centers could eventually be designed with less redundant physical infrastructure through increased use of resilient software and networks. Software and networks will ultimately assume an even greater role when it comes to ensuring the availability of data centre services.

That’s the view of a growing number of data centre experts, including Future-tech, who believe forms of so-called ‘distributed resiliency’, or ‘geo-redundancy’, will enable more facilities to be built and operated with fewer redundant power and cooling systems in the future.

However, as with most things in the data centre, the reality is more nuanced and complex. Software and networks already play a pivotal role in maintaining service quality in most data centers and many companies use some kind of back-up or disaster recovery site. The adoption of different types of distributed resiliency is therefore a progression rather than a new phenomenon.

Resiliency defined

Data center availability specialist Uptime Institute (UI) defines resiliency as: “the extent to which a system, digital infrastructure, or application architecture is able to maintain its intended service levels, with minimal or no impact on the end users or business objectives, in spite of planned or unplanned disruptions. It also describes the ability of a system infrastructure or application to recover full business operation after a disruption or disaster has occurred.”

Uptime has established itself as an authority for the design and operation of highly available data centers with its Tier and Maintenance & Operations (M&O) certifications schemes. But given the move towards so-called hybrid IT strategies, where companies combine the use of on-premise IT, colocation and various flavours of cloud services, Uptime has extended its remit to assess the role of software and networking in maintaining resiliency.

Datacenter resiliency can be categorised into four main types according to Uptime:

Traditional single-site resilience – One site with resilient facilities and IT. This is the traditional approach to resiliency with Tier III being the de facto design. However the downside according to UI is that no matter how much one data centre is protected there is still a risk from issues that are particular to that zone or region.

Linked-site resilience – This includes two or more sites that are tightly connected within a region or zone using dedicated networks to achieve a higher level of availability than any individual site. The downside of this approach is that it is still potentially vulnerable to regional incidents and as such probably requires additional disaster recovery in another area to achieve greater resiliency.

Distributed site resilience – Two or more independent sites use a shared internet/VPN network to provide resiliency through multiple asynchronously connected instances. The benefits of this approach are that it can reduce or eliminate vulnerability on a local or regional level. It can also enable reduced investment in physical redundancy. The downside is that it introduces additional complexity and expense in the IT layer.

Cloud-based resilience – This very advanced approach is based on the use of distributed, virtualized applications, instances or containers using middleware, orchestration and distributed databases across multiple datacenters. According to Uptime, cloud-based resiliency is probably the most effective approach but it requires heavy IT investment, many sites and a lot of bandwidth so it is only really an option for a handful of suppliers at the moment.

A number of companies are already adopting forms of advanced distributed resiliency including cloud-based. For example, AWS ensures the availability of its cloud services platform through the use of Regions and Availability Zones (AZs). A region is an area of the world where AWS operates a number of data centres. An AZ may be one or a group of data centres in a specific region. Customers can choose to replicate data either across multiple AZs in a specific region or, for even more resiliency, across multiple regions.  AWS says it operates 46 AZs within 17 geographic Regions around the world.

Return on investment

During a recent webcast Andy Lawrence, executive director of the recently formed Uptime Institute Research unit, said an operator’s choice of resiliency strategy depends on a range of factors including the type of database and applications used as well as the business needs. “There are certain types of business applications that absolutely cannot be distributed across wide areas effectively and there are certain kinds that could be and that would need major rewrites,” he explained.

Advanced cloud-level resiliency probably won’t be financial viable for most enterprises for at least another decade. And even though it might mean that spending on physical equipment, generators, UPSs, could be reduced in some scenarios, the extra spending on IT and network resiliency would probably negate the physical infrastructure savings.  “You may be able to reduce some of your physical infrastructure but you will be paying for it in terms of redundant IT, software engineering, and staff,” said Lawrence.

Other challenges

There are also some concerns that some forms of distributed resiliency could actually make it harder for operators to ultimately understand how resilient their IT operations really are. Rather than all workloads being contained in a single ‘fortress’ data centre, they may be distributed across multiple colocation and cloud sites that may be in different regions with varying levels of redundancy. In the event of an outage, understanding the ultimate cause would be considerably more challenging than in a single site (See our article on data centre forensics)

Outlook

Despite these concerns and doubts about the cost savings from advanced cloud resiliency overall, as more enterprise workloads are deployed in the cloud, and edge datacenters, it seems likely that opportunities to employ advanced approaches to resiliency will only become more widespread.

Part two of this report series will examine how Future-tech is responding to developments in distributed resiliency with advanced geo-redundant and “polymorphic’ data centre designs.

To read part two of Distributed resiliency click here.

Future-tech has developed a range of innovative designs that are specific to areas such as distributed resiliency, direct component cooling, cryptocurrency and blockchain. If you would like to discuss the benefits of geo-redundant or polymorphic data centres further please contact us at: info@future-tech.co.uk