1. Field of the Invention
The present invention relates to error identification in areas of a computer system that are used in common by multiple concurrent operations or by multiple independent processors, or by both. More particularly, the invention relates to an apparatus and method for minimizing the impact of a hardware error that occurs in an area in which operations are extensively interleaved or pipelined, or one that is detected in such an area after having been propagated into the area from some other part of the computer system.
2. Discussion of Related Art
Where the reliability and availability of the computer system are vitally important, the system's ability to recover from a hardware error is an issue of primary importance. To achieve this, it is necessary to be able to identify what needs to be recovered. However, the increased complexity of computer hardware that permits high-speed execution of multiple operations simultaneously is making such error identification extremely difficult when errors are detected in common areas of the hardware.
Various types of error flags that identify hardware devices in which an error has been detected are well known in the art: parity check flags associated with common data buses and instruction units, flags associated with multiplier and ALU saturation and overflow conditions and other flags for particular failure modes or individual pieces of hardware. However, in the more complex systems, more than one operation is likely to be affected by a hardware error and more than one error flag is likely to be set before the affected operations are halted
A secondary error discrimination method and apparatus is described in co-pending U.S. patent application Ser. No. 07/211,584 filed June 27, 1988 by P. D. Harney et al. and commonly assigned, which is incorporated herein by reference. This secondary error lock-out system records which error was the first error that occurred within a given area in the computer system, by latching all error flags that are set within the single clock cycle in which the first error is reported. These errors are the "primary errors." However, the processing is then halted and only the device in which the error occurred is identified.
In systems that do not involve pipelining, multiprogamming or multiprocessing, one known method for pinpointing the particular operation affected by an error uses the processor's Instruction Length Register (ILR). The ILR normally contains the address of the one instruction that can be executed at a given time in such systems. When an error is detected in the processor, the ILR is promptly locked. This permits the contents of the ILR to be used as a pointer to the instruction that caused the error, as disclosed in IBM Technical Disclosure Bulletin, Vol. 28, No. 2, July 1985. However, this abruptly halts the processor's operation.
In complex systems, the foremost concern is to identify the error with a specific operation, not just a particular processor. Identification of the specific operation in which a hardware error occurred permits other operations that were already being executed in that retry domain to attempt to complete normally, that is, to "quiesce," which avoids retrying concurrent, unaffected operations. Retrying all those operations would produce unnecessary disruption of computer processing. Quiescing also reduces the need for operator intervention and scope of the retry operations that are required, by avoiding having to retry operations that were not affected by the error.
Software identification of a particular instruction giving rise to a software interrupt can be implemented in a multiprocessing environment by means of a uniquely assigned "instruction number", as disclosed in the copending U.S. patent application Ser. No. 200,688, filed May 31, 1988, and commonly assigned. However the occurrence of a software interrupt in a particular operation does not require, nor does the disclosed invention provide, a method or means for tracing of the subsequent history of that operation, because the affected operation has been halted by the interrupt at the affected point. No quiescing occurs in the event of such interrupts.
In areas where hardware is highly specialized and also highly interconnected, such as a cache storage area, or an I/O channel controller, error propagation is inevitable. The high degree of specialization in such areas makes a complete picture of an error hard to obtain, and the pipelining used to assure more efficient use of such areas compounds the problem. Moreover, the redundancy provided by multiprocessing computer systems increases a computer's ability to recover from errors, the complexity of the task of tracing a hardware error through multiple concurrent operations to locate data that may have been affected by an error, and to identify the operations that must be retried in these systems, is much more disruptive and time consuming.
Error tracing in pipelined computer operations is complicated by the fact that an error there is not generally detected in the same machine clock cycle in which it occurs. Furthermore, it is generally desirable to allow all operations that are unaffected and can complete to do so before processing is halted in areas where there is extensive pipelining. This is also particularly true in data storage areas and areas where block transfers are made, as is explained below. Thus the subsequent effects of an error, not just its location and present extent, must be identified in such computer operations.