There are many techniques in the computer industry for regulating access to common items. For instance, only one computer at a time can transmit data on a multi-drop communication line. To preserve the integrity of that communication line, some form of access ownership protocol must be run to uniquely select a single master or owner. Depending on the item being regulated, the controlling techniques may include, for example, collision detection, quorums, tokens, lock managers, distributed lock managers, central arbiters, back off timers, round robin scheduling or fixed arbitration.
Similar techniques are used to preserve system integrity in a fault-tolerant system. Fault-tolerance is the ability of a system to achieve desired results in spite of a failure in the system producing the result. To achieve fault-tolerance, either replication-in-time or replication-in-space must be implemented. Replication-in-time refers to reproducing the result at a later time because the original attempt did not succeed due to a failure in the system producing the result. Replication-in-space refers to having duplicate resources available at the time of the failure such that those duplicate resources are able to continue the intended operation and produce the desired result in spite of a failure.
When dealing with a fault-tolerant system that uses replication-in-space techniques, care should be taken to ensure that those duplicate resources do not accidentally operate independently. For example, a fault-tolerant system can be made disaster-tolerant by geographically separating the redundant components such that no single failure event will disable the entire system. Two computers appropriately linked in side-by-side computer racks can be considered disaster tolerant to one rack tipping over or losing power but will not be considered disaster-tolerant to a fire in that room. The farther apart the machines are removed from each other, the more tolerant they become to larger area disasters.
With separation comes the problem of deciding which machine should continue to operate in the event of a loss of communications between them. Both machines continuing to operate without coordination is a condition know as split-brain. Two computers operating on the same problem with the same preconditions but operating in uncoordinated environments may produce different but nonetheless valid results. An example is scheduling airline seats. Given the same map of assigned seats but with reservation requests arriving in different orders due to geographic separation of the computers, the choice of future seat assignments may be different between computers. Each computer's result is valid given its viewpoint of the problem space. As a result, the computers could create valid, but different, local databases that will be impossible to reconcile when communications are restored between the computers. For this reason, split-brain operation is to be avoided.