1. Field of the Invention
The invention relates generally to error detection and handling in computer systems, and more particularly to a method and apparatus for identifying data which is unusable and for initiating or inhibiting diagnosis of the hardware faults that caused the data to be unusable.
2. Description of Related Art
Computer systems typically include a number of active devices or components such as processors, I/O bridges and graphics devices, as well as a memory system. Any of these devices, or the interconnections between them, can experience hardware faults which cause errors in data or difficulty reaching data through the faulty hardware.
Many error management techniques have been developed to aid diagnosis and limit the effect of these errors. One simple technique is parity checking. Parity checking utilizes a single bit (the parity bit) associated with a piece of data (typically a byte) to determine whether there is a single-bit error in the data. Parity checking cannot detect multiple-bit errors, however, and provides no means for correcting even single-bit errors. A more sophisticated system uses error correction codes (ECC) to detect and even to correct some errors. (xe2x80x9cerror detection/correctionxe2x80x9d will be used generally herein to identify the systems and bits, or codes, which are used in both error detection and error correction, including ECC codes and parity.) An ECC code consists of a group of bits associated with a piece of data. A typical ECC system may use eight ECC bits (an ECC code) to detect and correct errors in a 64-bit piece of data. The ECC bits provide enough information for an ECC algorithm to detect and correct a single-bit error, or to detect errors in two bits. If this particular system detects errors in two bits, the errors cannot be corrected.
If a data error can be corrected (e.g., if the memory system uses a single-bit correcting ECC code and there is a single-bit error), the error is simply corrected and operation of the system continues normally. If an error cannot be corrected, it may propagate through the system, causing additional errors and prompting diagnoses of hardware faults which may not exist.
For example, with reference to the system illustrated in FIG. 1, a fault in cache 12 may cause an error in a data value stored in the cache. If the data in cache 12 is in a modified or xe2x80x9cdirtyxe2x80x9d state, it may have to be copied out of the cache so that other processors can use it. If an error in the modified cache line cannot be corrected, hardware diagnosis may be initiated to determine the source of the error. The error may then be propagated when it is written by processor 11 to the bus 19 and/or main memory 18. If another processor (e.g. processor 13) reads the data value and stores the data value in cache 14, it will see the uncorrectable error and may initiate a second hardware diagnosis. This second diagnosis may indicate a hardware fault in processor 13, cache 14 or main memory 18, when the error actually arose in cache 12. The error may propagate throughout the system, including processors and caches (e.g. 15, 16) which are interfaced remotely to the system (e.g. by interface 17).
If a data error cannot be corrected (e.g., if the memory system only uses parity checking, or if there are too many bit errors for an ECC system to correct), the data may be referred to as unusable data. When a prior art system attempts to access a piece of unusable data, the device from which the data is requested may respond in one of several ways. In one instance, a processor accessing the unusable data may retrieve the data, determine that it is unusable, and append a new ECC code based on the unusable data to allow subsequent errors to be detected. This would allow the original error to spread through the system without detection.
In another instance, the memory storing the unusable data might simply not return any data at all. In contrast to the previously described implementation, this would prevent the spread of the errors therein. In response to the failure to return any data, the device attempting to access the data would time out and initiate diagnosis to determine the source of the error. As indicated above, however, the error may have arisen prior to this access, so it is likely that the diagnosis will provide no useful information. In fact, the initiation of the diagnosis may actually confuse the issue of where the error arose, since the hardware involved in the access may not have caused the error. This implementation can also suffer substantial performance losses, since each device that attempts to access the data can time out and initiate a diagnosis, both of which waste otherwise useful processing power.
In another instance, the memory containing the faulty data may return the data and the associated ECC code as they were stored (i.e., with errors). In this situation, the processor accessing the data would initiate hardware diagnosis, which would likely turn out to be futile and confusing.
In another instance, the memory may return a predetermined ECC code which indicates a multiple-bit error. It could be difficult, however, for some devices accessing the data to distinguish between this predetermined ECC code (which indicates generally that the data is corrupted) and an ECC code which represents an actual, multiple-bit error. This difficulty could be increased if a subsequent single-bit error occurred in the transmission path between the memory and the accessing device.
Whatever the response of the device from which the data was requested, an access to unusable data usually results in one of two standard responses by the computer itself. The first of the standard responses is for the computer to interrupt its operation and reboot itself. This response, of course, results in the termination of all applications executing on the computer and the loss of all work performed by the applications up to that point. The applications have to be re-started and any lost work must be performed again. One of the significant problems with this response is that even those applications that did not access the unusable data, and would not have accessed this data, are nevertheless terminated.
The second of the standard responses is to provide an indication of the unusable data whenever the data is accessed. This may be accomplished by simply failing to provide the data (which typically causes the device requesting the data to time out,) or by providing the data along with an ECC code which indicates that the data is unusable. This second response resolves the problem of indiscriminately terminating all applications, as only those applications that access the data are aware of the error and have to handle the error (e.g. by terminating themselves). The device that receives the data, however, is aware only that it has not received error-free data. It may therefore be difficult for this device to determine how the error arose. Consequently, error reports may be generated each time the data is accessed, which may lead to unnecessary diagnoses or diagnoses of hardware failures which may not actually have occurred. In a system which uses a specific ECC code to indicate that unusable data has been detected earlier, the occurrence of an additional, later, single-bit error may further confuse the situation. Additionally, as indicated above, waiting for the requesting device to time out increases the average latency of memory accesses and degrades the performance of the system.
The problems outlined above may in large part be solved by a system and method for improving the isolation and diagnosis of hardware faults in a computing system. Generally speaking, the system provides a mechanism for indicating whether unusable data has previously triggered diagnosis of the hardware fault that caused the data to be unusable. The mechanism employs a flag associated with the data that indicates whether diagnosis has been performed. If diagnosis has not been performed, the flag is not set (i.e., the flag has a xe2x80x9cfalsexe2x80x9d value.) If diagnosis has already been performed, the flag is set (i.e., the flag has a xe2x80x9ctruexe2x80x9d value.)
Data may be unusable because it contains errors which are not correctable by available error correction mechanisms, or because it is missing (i.e., it is never received.) In one embodiment, whenever data is requested but not received, and whenever received data contains uncorrectable errors, status information is captured and diagnosis of the hardware fault that caused the error is initiated. The unusable data is passed on, along with a flag which is set to indicate that the data is unusable. If the received, unusable data had a flag which was not set, the flag is set. If the received, unusable data did not have a flag, the flag is generated and passed on with the unusable data. If the passed-on data is covered by error detection/correction, a new error detection/correction code is generated to cover the unusable data as well as possibly the flag so that any further errors can be detected. Thus, any uncorrectable errors or missing data will be indicated by the flag, but the fault that caused the data to be unusable will not be re-diagnosed each time the unusable data is passed on.
One embodiment comprises an interface which is used to convey data from one component or subsystem to another. When the interface receives data from the first subsystem, the data is examined to determine whether it contains an uncorrectable error (including missing data.) If the data contains an uncorrectable error, the interface examines the flag corresponding to the data to determine whether hardware fault diagnosis has already been initiated. If diagnosis has already been initiated, the data is passed to the second subsystem without initiating further diagnosis. If diagnosis has not been initiated, or if the flag itself is missing or unusable so that it is not clear whether diagnosis has or has not been performed, the interface initiates diagnosis and sets the flag to indicate that diagnosis has already been performed. The data (and corresponding flag) are then passed to the second subsystem. If the data contains an uncorrectable error, data error handling procedures will be performed, regardless of the value of the corresponding flag.
In one embodiment, the interface comprises a circuit that accepts data which does not include a flag and produces a corresponding flag. In this embodiment, data consisting of a value and an error detection/correction code is input to a detector/corrector. The detector/corrector checks the value against the error detection/correction code to determine whether the value is correct. (xe2x80x9cCorrectxe2x80x9d is used herein to describe data for which the corresponding error detection/correction code indicates no errors.) If the value is correct, the value is output on a data line while a xe2x80x9cfalsexe2x80x9d signal is asserted on a flag line. If the value is incorrect but correctable, the value is corrected and then output on the data line while a xe2x80x9cfalsexe2x80x9d signal is asserted on the flag line. If the value is incorrect and cannot be corrected rising the error detection/correction code, hardware diagnosis is initiated and a xe2x80x9ctruexe2x80x9d signal is asserted on the flag line. When the xe2x80x9ctruexe2x80x9d signal is asserted, the data line may output either the uncorrectable value or a predetermined value (e.g. some value describing the detected error.) The circuit may include an ECC generator that produces an error detection/correction code corresponding to the value and the flag output by the circuit.