FIG. 1 shows a typical Shared Nothing computer system architecture in the form of a database. In a database of such an architectures, the database information is partitioned over loosely coupled multiple processors 20 typically connected by a local area network 10. Each of the multiple processors 20a-20n typically has its own private non-volatile storage 30a-30n and its own private memory 40a-40n. One problem with a Shared Nothing architecture in which information is distributed over multiple nodes is that it typically cannot operate very well if any of the nodes fails because then some of the distributed information is not available anymore. Transactions which need to access data at a failed node cannot proceed. If database relations are partitioned across all nodes, almost no transaction can proceed when a node has failed.
The likelihood of a node failure increases with the number of nodes. Furthermore, there are a number of different types of failures which can result in failure of a single node. For example:
(a) A processor could fail at a node; PA1 (b) A non-volatile storage device or controller for such a device could fail at a node; PA1 (c) A software crash could occur at a node; or PA1 (d) A communication failure could occur resulting in all other nodes losing communication with a node.
In order to provide high availability (i.e., continued operation) even in the presence of a node failure, information is commonly replicated at more than one node, so that in the event of a failure of a node, the information stored at that failed node can be obtained instead at another node which has not failed. The multiple copies of information are usually called replicas, one of which is usually considered the primary replica and the one or more other copies considered the secondary replica(s).
The maintenance of replicas always involves an added workload for the computer system. This invention specifically relates to the problem of maintaining replicas in a more efficient manner.