This invention relates to distributed computing systems, and more particularly, to the dynamic reconfiguration of a quorum group of processors within a distributed computing system, and to a recovery procedure for one or more processors of the group which were unavailable during the dynamic reconfiguration.
Distributed computing systems employ a plurality of processing elements. These processing elements might be individual processors linked together in a network or a plurality of software instances operating concurrently in a coordinated environment. In the former case, the processors communicate with each other through a network which supports a network protocol. The protocol might be implemented by using a combination of hardware and software components. Processing elements typically communicate with each other by sending and receiving messages or packets through a common interface. One type of distributed computing system is a shared nothing distributed system wherein the processing elements do not share storage. Within such a system, the elements must exchange messages in order to agree on the state of the distributed system.
Thus, within a shared nothing distributed processing system, message exchange protocol is needed. For example, the message exchange protocol will seek to solve the problem of the current state of a database in the distributed processing system. Specifically, the protocol needs to define which processing element has the latest version of the database, since processing elements can create different database versions. As is well known, a high availability system allows one or more processing elements to become unavailable while the system continues to perform processing. Therefore, the database can be modified within a high availability distributed processing system while one or more processing elements are unavailable (e.g., off line). When a previously unavailable processing element becomes available, an updated version of the database must be provided to that processing element.
Conventional shared nothing distributed processing systems have the restriction that a group of processing elements participating in a quorum driven recovery must be static. That is, once a server group is defined members cannot be added or removed dynamically, i.e., while the database is running and one or more members are potentially unavailable. The only way to make a reconfiguration change in a conventional shared nothing distributed processing system is to use a redefine operation which requires a change to a configuration file in all servers of the system, and therefore requires that all servers be currently available for the reconfiguration change.
Notwithstanding the above, in the case of highly available distributed processing systems, such as database servers, it is deemed desirable to allow the addition or deletion of servers without requiring that all servers of a group of servers be available. The distributed server recovery procedure (DSRP) provided herein allows for this modification of the configuration of the server group requiring only that a majority (quorum) of the currently defined servers be available for the modification to proceed. For example, some servers may be unconfigured (excluded from the group) while they are down, and other servers may be added. The process of adding or deleting servers while one or more servers may be unavailable is referred to herein as xe2x80x9cdynamically reconfiguringxe2x80x9d the quorum group of processors. Again, the traditional procedures for recovery of distributed servers require a static configuration environment.
To summarize, provided herein is a method for recovering a current state of a quorum group of processors dynamically reconfigured while at least one processor of the quorum group of processors was unavailable. The method includes: obtaining the current state for the dynamically reconfigured quorum group of processors from one or more processors of the quorum group of processors; and wherein each processor of the quorum group of processors includes an incarnation number and a member list of processors which participated in a commit process resulting in its incarnation number, and wherein the obtaining comprises checking the one or more processors of the quorum group of processors for incarnation numbers and member lists of processors which participated in commit processes resulting in the incarnation numbers.
System and computer program products corresponding to the above-summarized methods are also described and claimed herein.
To restate, provided herein is a reconfiguration capability for dynamically reconfiguring a quorum group of processors notwithstanding that one or more processors of the group may be unavailable, as well as a recovery procedure for implementation by the processors of the group when the one or more previously unavailable processors become available. By being able to dynamically reconfigure a group of processors while one or more of the processors are unavailable, a system administrator can ensure that critical systems are maintained even if one or more processors become unavailable, provided that a quorum of processors remains. The dynamical reconfiguration capabilities and recovery procedures described herein thus provide greater flexibility in a high availability, distributed computing environment. A relaxed quorum calculation is also presented for use with a quorum based operation, such as the recovery procedure described herein.