This invention relates to a multi-processor system in which first and second processing sets (each of which may comprise one or more processors) communicate with an I/O device bus.
The application finds particular application to fault tolerant computer systems where two or more processor sets need to communicate with an I/O device bus in lockstep with provision for identifying lockstep errors in order to detect faulty operation of the system as a whole.
In such a fault tolerant computer system, an aim is not only to be able to identify faults, but also to provide a structure which is able to provide a high degree of system availability. In order to provide high levels of system availability, it would be desirable for such systems automatically to attempt recovery from a fault, or error condition.
Automatic recovery from an error provides significant technical challenges in that the system has to provide an environment where it can continue to operate following a fault in a manner which does not further corrupt the system while permitting diagnostic operations to be performed.
Accordingly, an aim of the present invention is to address these technical problems.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Combinations of features from the dependent claims may be combined with features of the independent claims as appropriate and not merely as explicitly set out in the claims.
In accordance with one aspect of the invention, there is provided a bridge for a multi-processor system. The bridge comprises a first processor bus interface for connection to an I/O bus of a first processing set, a second processor bus interface for connection to an I/O bus of a second processing set and a device bus interface for connection to a device bus. It also comprises a bridge control mechanism configured to be operable, in an operational mode to permit access by at least one of the first and second processing sets to bridge resources and to the device bus and, in an error mode, to prevent access by the processing sets to the device bus and to permit restricted access by at least one of the processing sets to at least predetermined bridge resources.
By providing restricted access to selected parameters held in the bridge during an error mode, the bridge can act as a secure repository for information which can be used by the processing sets to investigate and diagnose the error and hopefully to recover therefrom. By preventing the processing sets from having access to the device bus, a faulty processing set can be prevented from corrupting devices connected to the device bus.
It should be noted that the bus interfaces referenced above need not be separate components of the bridge, but may be incorporated in other components of the bridge, and may indeed be simply connections for the lines of the buses concerned.
The bridge control mechanism can be operable, in response to detection of an error state, to cause the bridge to cease operation in the operational mode and instead to operate in the error mode.
Storage can be provided in the bridge for buffering data pending resolution of the error. For example, error state registers can be provided for saving operating parameters on entry to the error mode, read only access to the error state registers being permitted by at least one processing set during the error mode. A posted write buffer can be provided for the storage of writes already posted by at least one processing set on entry to the error mode, read only access to the posted write buffer being permitted by at least one processing set during the error mode.
The bridge control mechanism can be operable in an initial error mode to store in the posted write buffer any internal bridge write accesses initiated by the processing sets and to allow and to arbitrate any internal bridge read accesses initiated by the processing sets. It can also be operable I the initial mode to store in a posted write buffer any device bus write accesses initiated by the processing sets and to abort any device bus read accesses initiated by the processing sets.
In a primary error mode in which a processing set asserts itself as a primary processing set, the bridge control mechanism can be operable to allow and to arbitrate any internal bridge write accesses initiated by the primary processing set, to discard any internal bridge write accesses initiated by any other processing set, and to allow and to arbitrate any internal bridge read accesses initiated by the processing sets. It can also be operable in this mode to discard any device bus write accesses initiated by the processing sets and to abort any device bus read accesses initiated by the processing sets.
The primary processing set is a processing set which determines that it is operational, and not faulty, as a result of a fault analysis process. This allows any write accesses for the bridge or for the device bus which have already been posted by the processing sets to be saved during the initial error phase. Later write accesses to the device bus can be discarded as being erroneous. Read accesses to the device bus can safely be aborted as they can be resent on exit from the error mode. Read access by the processing sets to the bridge is possible for diagnostic purposes. When a processing set asserts itself as a primary processing set, this processing set is then able to have write access to the bridge as well.
The bridge control mechanism can be further operable, in a split operational mode, to arbitrate between the first and the second processing sets for access to each others I/O bus and to the device bus and, in a combined operational mode, to monitor lockstep operation of the first and second processing sets.
The bridge control mechanism can be operable on power up of the bridge to in an initial error mode until a processor set asserts itself as a primary processing set, then in the split operational mode to enable all processing sets to be set to a corresponding state before transferring to the combined operational mode.
The bridge can include a storage sub-system and a controllable routing matrix connected between the first processor bus interface, the second processor bus interface, the device bus interface and the storage sub-system, the bridge control mechanism being operable to control the routing matrix selectively to interconnect the first processor bus interface, the second processor bus interface, the device bus interface and the memory sub-system according to a current mode of operation.
The bridge can include at least one further processor bus interface for connection to an I/O bus of a further processing set.
In accordance with another aspect of the invention, there is provided a computer system comprising a first processing set having an I/O bus, a second processing set having an I/O bus, a device bus and a bridge, the bridge comprising a first processor bus interface connected to the I/O bus of the first processing set, a second processor bus interface connected to the I/O bus of the second processing set, a device bus interface connected to the device bus and a bridge control mechanism as described above.
In accordance with a further aspect of the invention, there is provided a method of operating a multi-processor system comprising a first processing set having an I/O bus, a second processing set having an I/O bus, a device bus and a bridge, the bridge comprising a first processor bus interface connected to the I/O bus of the first processing set, a second processor bus interface connected to the I/O bus of the second processing set and a device bus interface connected to the device bus, the method comprising selectively operating the bridge:
in an operational mode to permit access by at least one of the first and second processing sets to bridge resources and to the device bus; and
in an error mode to prevent access by the processing sets to the device bus and to permit restricted access by at least one of the processing sets to at least predetermined bridge resources.