Historically, computing clusters that span physical locations are challenged to differentiate a network failure between the two datacenters from an actual service failure in the active datacenter. One conventional solution to this problem is the addition of a third datacenter. The third datacenter is effectively a “witness” and can vote as to which of the two datacenters should have services up. However, while this provides a suitable solution, this also increases the networking and facility costs. Moreover, a weakness of this solution is that the WAN (wide-area network) networking failures can cause service outages in the active datacenter. It can be argued that customers should not experience a concurrent outage of both WAN networks, if deployed on independent hardware. An alternative solution is to create a second network connection between the two locations that is failure independent of the first. This also adds complexity to the deployment and it becomes difficult to determine what represents a failure-independent connection.
In a two-datacenter configuration customers are obligated to inject operational procedures into the solution. A solution can be created by manually activating the messaging solution in the passive datacenter (initially, datacenter2). However, this still does not address the behavior of the messaging solution in the active datacenter (initially, datacenter1). For example, a power failure in the active datacenter (datacenter1) can trigger the activation of the passive datacenter (datacenter2) messaging deployment; making it now the active datacenter. If the datacenter1 has power restored without a connection to the datacenter2 (or manual intervention) then the datacenter1 will automatically return to service, thereby creating a “split-brain” condition. This is “split brain” because both datacenter1 and datacenter2 messaging solutions are in service.
A second aspect of the problem is managing site resilience for a large scale service deployment. In the large-scale deployment case the number of systems with which an operation team must interact limits the timeliness of the recovery. In a service environment maximizing service uptime is essential. Thus, the resource intensive aspect of the large-scale deployment increases downtime for the service, and thus, further degrades the customer experience.