The present invention relates, in general, to the field of systems and methods for dynamic information storage or retrieval. More generally, the present invention relates to a system and method for effectuating distributed consensus utilizing shared storage resources and state coordination among members of a processor set in a multiprocessor environment.
A computer system generally includes at least one processor to perform computations and control components coupled to the computer system. Some computer systems include multiple processors that access shared resources within the computer system. For example, multiple processors may individually access a shared hard disk, a shared input/output (xe2x80x9cI/Oxe2x80x9d) channel, a shared peripheral or a shared memory space to perform a particular function. Furthermore, such multiprocessor systems may allow a processor to communicate with other processors within the computer system through access to shared resources. For example, it is common for a processor to store data intended for another processor in a shared memory location. Thereafter, the other processor can read the data from the shared memory location.
It is also common for multiple processors in a computer system to share a storage location, for example, in a database stored on a hard disk. Preferably, access to the shared storage location is coordinated to provide exclusive access to the shared storage location by any single processor. Otherwise, one processor may independently modify the contents of the shared storage location without notice to another processor accessing the shared storage location at approximately the same time. Such processors are termed xe2x80x9ccompeting processorsxe2x80x9d, in that they are competing for access to a shared storage location. A possible result of non-exclusive access to a shared storage location by competing processors is that corrupted or unintended data may be read or stored in the shared storage location by one of the processors.
The aforementioned co-pending patent application discloses a particularly efficacious system and method for providing exclusive access to shared storage that does not rely on advance knowledge of the set of processors potentially accessing the shared storage. Furthermore, it advantageously affords an exclusive access solution that accommodates competing processors without deadlock, accommodates the unpredictable timing properties of a shared storage subsystem and does not rely on the particular properties of any particular shared storage subsystem.
One technique for synchronizing distributed state among a set of processors is known as xe2x80x9cdistributed consensusxe2x80x9d. Functionally, each processor is viewed as a state machine and all processors initially start in the same state. An input to the state machine (i.e. a command) produces an output and a new state. If all the processors agree on the inputs to their state machines (i.e. a consensus), then all of the processors will have the same state. Certain distributed consensus techniques also allow for processors to fail and then catch up with the current state when they restart. One such published algorithm (a.k.a. the xe2x80x9cPaxosxe2x80x9d algorithm) suitable for a variety of distributed systems is described by Leslie Lamport in xe2x80x9cThe Part-Time Parliamentxe2x80x9d, ACM Transactions in Computer Systems, Vol. 16, No. 2, May 1998, pages 133-169, the disclosure of which is herein specifically incorporated by this reference.
To date however, all such distributed consensus processes have utilized communication among the processors in the set in order to obtain consensus. An inherent deficiency of such techniques, is that they then require a majority of a known set of processors to participate in the consensus. If a majority of the processors are not available, the process fails to make forward progress. This is not desirable in those instances where processors are relatively expensive in terms of overall system cost and it is required that but a single surviving processor be able to continue to provide service.
The system and method of the present invention achieves distributed consensus among members of a processor set even when only a single processor is operating. This is achieved by having a collection of processors jointly implement a virtual state machine and wherein the state machine utilizes a sequence of numbered input commands. System synchronization is achieved by having all of the processors agree on the sequence of input commands so that they execute the same virtual state machine. Input commands are numbered consecutively and the processors use a set of shared stores (i.e. disk drives) to communicate amongst themselves requests (i.e. ballots) for new state machine inputs (or commands) and state machine inputs that have already been chosen (i.e. committed commands). A consensus process is used to decide upon (or commit) each command. Furthermore, this consensus is achieved using a majority of known stores rather than a majority of known processors. Therefore, when consensus is achieved, it then exists on the system stores (e. g. the disk drives) and not in the processors.
In a particular embodiment of the present invention disclosed herein, the process is implemented utilizing a known set of xe2x80x9cconsensus disksxe2x80x9d comprising the shared stores. Each processor participating in distributed consensus has one disk block reserved to that processor on each consensus disk. An exemplary disk block may contain the following information: a) a list of the most recently committed commands; b) a ballot number; c) the command a processor is trying to commit; d) the processor""s unique identification (xe2x80x9cIDxe2x80x9d); and e) any additional information needed to determine the current state of the virtual machine. Each processor also maintains a copy of its current state, and this state may be in the same form as that of the disk blocks.
The procedure for reserving one disk block for each processor on each consensus disk necessitates some means for reserving exclusive access to the disk long enough for a processor to reserve a block. This reservation is recorded in a xe2x80x9cdirectory blockxe2x80x9d that assigns processor identification (xe2x80x9cIDsxe2x80x9d) to disk blocks. To this end, known mutual exclusion algorithms may be utilized and the system and method for exclusive access to shared storage disclosed and claimed in the aforementioned patent application incorporated by reference herein, is one particularly efficacious technique.
As disclosed in greater detail herein, an exemplary distributed consensus process in accordance with the present invention requires each processor participating in the consensus algorithm to have a unique ID not shared by any other participating processor. This ID may, in some instances, be conveniently considered to be the low-order digit of all of its ballot numbers in order that ballot numbers issued by different processors are unique and totally ordered. Furthermore, since a processor must be able to read and write a majority of the known set of xe2x80x9cconsensus disksxe2x80x9d in order to make forward progress with this process, each processor that desires to submit a numbered state machine input for consensus agreement (i.e. commit a numbered command) will implement the process.
A representative process for distributed consensus utilizing shared storage resources and state coordination among members of a processor set in a multiprocessor environment as disclosed herein may conveniently operate in two separate rounds. In a first round, a processor is allowed to set its ballot number to a value greater than or equal to its current ballot number. (Generally, the ballot number is chosen to be greater than the numbers of any other ballots in progress). At this point, the processor reads its own disk block on each consensus disk in order to obtain current knowledge of the virtual state machine execution and the ballots it has already issued. If the processor already has knowledge of this information, this step can be omitted.
The processor then writes its current information to its own disk block on each consensus disk and it then reads a directory of processor-to-disk-block assignments for the other processors participating in the process. This directory is also kept on disk and a mutual exclusion algorithm protects writes to this directory as previously described. The processor then reads the disk block for each other processor on each consensus disk in order to detect if another processor is attempting to commit a command. If the processor reads that another processor has already committed a command with this number, then the reading processor aborts the process and adopts the already committed command. If it reads that another processor has issued a higher-numbered ballot, or it reads that the processor itself has issued an equal or higher numbered ballot for the same command, it aborts its own ballot.
This first round completes when the processor has read the disk block of every processor in the directory from a majority of the consensus disks. When round 1 is complete, the processor chooses the command from the highest numbered ballot that was found while reading the disk blocks. If no command was found, the processor can attempt to submit its own command for balloting.
The second round begins with the processor writing its current information to its own disk block on each consensus disk. At this point, the processor then reads a directory of processor-to-disk-block assignments for the other processors participating in the algorithm. This directory is also kept on disk and the same mutual exclusion process may be used to protect writes to this directory. The processor then reads the disk block for each other processor on each consensus disk. If it reads that another processor has already committed a command with this number, then the reading processor aborts the process and adopts the already committed command. If it reads that another processor has issued a higher-numbered ballot for the same command, it aborts its own ballot.
This second round completes when the processor has read the disk block of every processor in the directory from a majority of the consensus disks. At this point, the processor moves the committed command to the list of most recently committed commands and then writes its current state to its disk block on each consensus disk.
The final read operation in each round essentially detects if another processor is attempting to commit a command with the same command number. If that were the case, the highest-numbered ballot would take precedence. However, a processor can begin the first round again with a higher ballot number.
Competing processors can thus prevent each other from committing a command. Utilization of a secondary process that causes one of the processors to back off and allow the other to commit would solve this problem. An exemplary process is referred to as the xe2x80x9cweak leader electionxe2x80x9d in the aforementioned Paxos algorithm.
Particularly disclosed herein is a multiprocessor computer system for effectuating distributed consensus among two or more processors. The system comprises at least one shared storage device accessible by each of the processors, a directory block designated on the storage device indicative of each of the processors participating in said consensus and a reserved portion on the storage device corresponding to each of the processors designated in the directory block. The reserved portion includes a listing of the most recently committed commands, a number assigned by the processor to a requested command and an identification of the requested command. Each of the processors are operative to read the directory block and the reserved portion of the storage device for all of the processors participating in the consensus.
Also particularly disclosed herein is a method and computer program storage medium readable and executable by a computer for effectuating distributed consensus among two or more uniquely identifiable processors in a multiprocessor computing system incorporating at least one shared storage device. The method comprises the steps of incrementally assigning numbers to requested commands input by each of the processors, utilizing the shared storage device to communicate requested commands and previously committed commands among the processors and determining among the processors which of the requested and previously committed commands are to be executed by each of the processors based upon the assigned numbers.