1. Technical Field of the Invention
The present invention relates to general processing systems and, in particular, to a system and method for synchronizing processors in a fault tolerant multi-processor system.
2. Description of Related Art
Fault tolerant systems, such as, for example, fault tolerant computer systems, are used in real time systems which must be up and running 24 hours a day. These systems are normally implemented using two or more redundant processing units executing the same programs in synchronization. Special methods are used to keep the processing units in synchronization, to detect and localize faults, and to reintegrate replaced units. As special designed hardware often is required to accomplish these functions, fault tolerant systems are usually relatively complicated to design.
It would be advantageous if the design of fault tolerant systems could be simplified by using commercially available components, such as, for example, state of the art microprocessors and memory, not especially designed for fault tolerance, as often as possible. This would make the fault tolerant systems less expensive, easier to design and upgrade when faster compatible components become available.
One way to simplify the design of fault tolerant systems is to lessen the requirement for run-time synchronization between the processing units. This approach will simplify the interaction between the processing units and make it easier to use commercially available components in their design. However at the same time, it becomes more difficult to synchronize a processing unit that has been out of synchronization with other processing units, such as, for example, when a processing unit is replaced. The working processing unit has local memory where information on all executing programs is stored. This state related information includes data describing the state of each executing program (each executing program consisting of a number of executing processes), data variables used by each executing program, etc. The replaced processing unit has to get its local memory updated with this information from one of the working processing units before the replaced processing unit can be brought into parallel operation again.
A simple method to update a replaced processing unit's local memory is to temporarily stop the normal program execution in the working processing units, while copying all state related information from the working processing units to the replaced processing unit. However, this approach delays the normal program execution by an amount that is proportional to the amount of information that has to be updated, and the inverse of the available bandwidth of the communication channel between the processing units, which is used to copy that information. In most cases this would require a very high bandwidth communication channel in order not to cause a longer operational delay than can be accepted in a fault tolerant system.
Another method used to update a replaced processing unit is to keep executing the normal programs while a background process copies all state related information from the working processing units to the replaced processing unit. Any changes to state related information in the working processing units local memory made during the background copying process will be transferred to the replaced processing unit in real-time on a communication channel between the working processing units and the replaced processing unit. This approach requires the bandwidth of the communication channel between the processing units to meet the maximum frequency of the information changes in the executing programs, which again complicates the interaction between the processing units.
It is, therefore, one object of the present invention to provide a simplified, yet highly reliable design for a fault tolerant system. Another object of the present invention is to use commercially available components in the design of fault tolerant systems as often as possible. A further object of the present invention is to simplify the interaction between processing units in a fault tolerant system. Still another object is to provide an improved method of re-integrating replaced units into a fault tolerant system.