In the design of a computer system, it is important to provide a degree of system availability that is necessitated by the applications for which the system is intended. System availability ranges from the minimization of a system's down-time in event of a failure, to the ability of a system to remain functional in spite of occurrences of failures in portion thereof.
One common method of attaining high system availability is to have a multiprocessor system wherein the workload of a failed processor can be transferred to a backup processor. One of the major concerns in this method, however, is to minimize the tradeoff cost, which, besides the cost of extra hardware, often includes a performance degradation due to additional processing cycles spent on implementing the processing backup.
In an in-flight control system of spacecrafts, for example, the application requires that failures in any one portion of the system must not cause any interruption or delay of its functioning. Such availability is achieved by having N identical processing units executing in a redundant manner, so that operation of the system can continue even with the presence of failures in one or more units. While achieving high availability, the redundancy required thereby is usually too costly for most commercial applications.
The Tandem computer architecture, as disclosed in U.S. Pat. No. 4,228,496, represents an alternate approach in system redundancy and availability. During normal operation, each processor in the Tandem multiprocessor system would process different transactions. When one processor fails, its workload is transferred to a backup processor. To enable backup processing, each processor periodically communicates checkpoint information to the other processors. When a backup processor takes over, it reconstructs the interrupted processes from checkpoints before continuing their processing. The need to reconstruct processes from checkpoints, however, means that some real time delays must be experienced by the interrupted transactions.
An object of this invention is to provide a fault-tolerant multiprocessor system in which each processor operates independently during normal operation, but in which processing of a failed processor can be continued by backup processors. Furthermore, it is a related object of this invention to ensure that backup processing can be continued immediately, without the need of transaction reconstructions and without performance degradation.
In order to facilitate backup processing, system control information must be communicated between processors. Therefore, a further related object of this invention is to provide a method and apparatus for communicating information between multiple processors.
A prior art method for communicating common system information is described by Luiz et al in U.S. Pat. No. 4,207,609, and assigned to the assignee of the present invention. Therein, common system information required by a storage path (i.e. the map of network topology and necessary context information) is stored in a common control node (the dynamic pathing memory, DPM) in the network. Access to the common information involves communication between a processor and the DPM. Moreover, if access to the DPM becomes unavailable because of failures occurring in the DPM, the storage path would become disconnected. In other methods for communicating common information between processors, the communication is not performed transparently to the operation of processors, resulting in system performance degradation
Therefore, a related object of this invention is to enable the transparent communication of common system information, as well as to ensure that system availability would not be degraded because of inaccessibility of this system information.