The invention relates to data processing systems and, more particularly, to recovery from errors occurring in data processing systems employing multiple busses.
In computers and data processing systems, a bus is commonly employed to interconnect the various elements of the system. For example, a central processing unit is typically connected to memory components, input/output (I/O) devices, etc. via a bus capable of carrying the signals associated with the operation of each element. These signals include, for example, data signals, clock signals, and other control signals. The bus must be capable of carrying such signals to all components coupled to the bus so that the desired operation can be carried out by the computer system.
As computer systems achieve increasingly higher levels of performance, it is sometimes desirable to provide more than one bus in the computer system. For example, it may be desired to provide a high speed main system bus interconnecting processors and high speed memory components, and to provide a separate bus interconnecting I/O devices such as disc drives and tape drives to an I/O controller.
The separate busses in a multibus computer system must be interconnected, which introduces complexities into the system. One method for interconnecting busses is to provide a bus interconnect adapter consisting of first and second adapter modules each connected to one of the busses, and an interconnect bus connecting the two adapter modules. When data is to be transferred from one bus to the other, a transaction is initiated on the one bus, according to a predetermined set of rules, commonly called a protocol. The adapter module connected to the bus on which the transaction is initiated obtains control of the interconnect bus and transmits data to the other adapter module over the interconnect bus. The other adapter module then initiates a transaction on the second bus.
A non-pended bus is often employed in multibus computer systems. On such busses, control of the bus remains with the device initiating a transaction until completion of the transaction. Thus, a READ transaction on a non-pended bus will result in control of the bus remaining with the initiating device until the responding device has returned the requested data, tying up the bus until completion of the transaction. A WRITE transaction can be completed quicker since data only has to travel in one direction on the bus.
In order to attain higher bus performance, transactions known as "disconnected WRITE" transactions are often employed on a non-pended bus. A WRITE transaction is initiated from a device on a first bus to a device on the second bus. Immediately upon successful reception of the transaction by the bus adapter on the first bus, an acknowledge (ACK) confirmation signal is returned on the first bus to the device which initiated the transaction. As far as the initiating device knows, the transaction has been successfully completed, and additional transactions can occur on the first bus. However, at this point in time, the WRITE data has not yet reached its final destination on another bus. System integrity or reliability can be reduced if an error occurs after an ACK confirmation is returned to the initiating device. For example, if a parity error occurs as a result of the transmission of data from the first bus to the second bus, completion of the WRITE transaction would result in the storage of invalid data. Therefore, the erroneous data is not stored. However, the initiating node has already been informed that the transaction was successfully completed (via the ACK signal). The initiating node thus has no means of knowing that the data was not stored and thus has no reason to initiate a repeat WRITE transaction. The system thus loses necessary data and must identify the error as a non-recoverable error by generating a signal to the operating system software of the computer system to initiate a system shut-down.
This error handling technique maintains system integrity by preventing non-recoverable errors from generating invalid data or permitting lost data in the system, but also results in system shut-downs where the error would not result in invalid or lost data. That is, an error occurring during a READ transaction would not result in the loss of data or storage of invalid data, since the requesting node may be signalled to repeat the READ transaction request. However, known prior art multibus computer systems do not recognize that an error under such conditions is recoverable, and initiate a system shut-down for each transaction in which a parity error occurs.