1. Field of the Invention
The invention relates in general to fault-tolerant computer systems and, more particularly, to mechanisms for upgrading the systems to include additional central processing units (xe2x80x9cCPUsxe2x80x9d) while the system is operative.
2. Background Information
The fault-tolerant systems of interest operate redundant CPUs in lock-step, that is, in cycle-to-cycle synchronism. Accordingly, before an off-line CPU is brought on-line, to upgrade the system from single-mode redundancy to double-mode redundancy or double to triple-mode redundancy and so forth, the off-line CPU must first be synchronized to the state of an on-line CPU. Similarly, an off-line CPU must be synchronized to the on-line CPU when, for example, a faulty CPU is replaced.
In prior known lock-step systems, the on-line CPU communicates directly with the off-line CPU in accordance with a special synchronization protocol. The CPU boards in the prior system include dedicated synchronization hardware that allows the CPUs to communicate using the synchronization protocol. Accordingly, the CPU boards are both time consuming and expensive to design and manufacture.
Using the synchronization protocol, the on-line CPU directs the off-line CPU to set various components, such as certain registers and memory locations, to states that correspond to the states of the associated registers and memory locations of the on-line CPU. The on-line CPU thus controls a series of back and forth communications between the two CPUs, to provide the state information to the off-line CPU and to instruct the off-line CPU to use the information to set the registers and memory locations to the appropriate states. Accordingly, the other processing operations performed by the on-line CPU may be disrupted during the synchronization process.
The inventive system includes an I/O subsystem that controls the synchronization of an off-line CPU to an on-line CPU, such that much of the synchronization operation takes place essentially as a background task for the on-line CPU. The I/O subsystem requests that the on-line CPU to provide certain register and memory state information to general purpose registers on an I/O board. The I/O subsystem then copies the register contents to general purpose registers on the off-line CPU board, and the off-line CPU uses the information to set the states of certain of its registers and memory. The I/O system further includes a DMA engine that, at a time set by the I/O subsystem, copies is pages of memory from the on-line CPU to the off-line CPU.
At the end of the synchronization operation, the off-line CPU is directed to write to a predetermined register on the I/O board. When the off-line CPU performs the write operation, it indicates that the off-line CPU is in a known state and ready to go on-line. The I/O subsystem then holds the off-line CPU in the known state by stalling the return of an acknowledgement of the write operation. When the on-line CPU later performs the same write operation, the on-line and the off-line CPUs are then in essentially the same state, and the I/O processor resets the CPUs to ensure that the off-line CPU goes on line and starts a next operating cycle in lock-step with the on-line CPU.
The I/O subsystem includes comparison logic that is updated when the off-line CPU changes its status to on-line as part of the reset operation. The comparison logic then compares the output streams from the previously on-line CPUs and the newly added on-line CPU. Accordingly, after the CPUs reset, the comparison logic compares two output streams if the system went from single to double modular redundancy, or three output streams if the system went from double to triple modular redundancy, and so forth. As discussed in more detail below, when the output streams do not agree the comparison logic also properly handles voting based on the number of on-line CPUs. The system thus dynamically changes its comparison method, as CPUs are added to or removed from the system.
The communications between the on-line CPU and the I/O subsystem, and the I/O subsystem and the off-line CPU do not require a special synchronization communication protocol. Accordingly, the synchronization operation is less complex than the synchronization operations of the prior lock-step systems. Further, the components involved in the synchronization operation, namely, the general purpose registers and the DMA engine, are used for more than just the synchronization operation, and are thus not dedicated synchronization hardware. Also, the synchronization operation is controlled by the I/O subsystem, and thus, the processing operations of the on-line CPU are only minimally interrupted or disrupted. Finally, the comparison logic used to ensure valid output streams dynamically changes based on the number of on-line CPUs, and the system can thus be upgraded in the field.