Today's e-business environment places great demands on the reliability and availability of computer systems that drive their infrastructure. The rising density of circuits and interconnects, coupled with the need to run computer systems continuously, leads to an increased potential for hardware errors. Historically computer systems have employed a variety of methods to deal with errors that occur within the data being transferred throughout the system. Low end systems tend to protect the data with parity bits which usually results in the need to reboot the system whenever such a parity error occurs.
An improvement on this approach is the use of error correction codes (ECC) to detect correctable and uncorrectable errors. Correctable errors are situations where a small number of bits flip (typically one), and the ECC code is able to calculate exactly which bits flipped and revert them back to their original state. The data correction is typically done “on the fly” while the data is being transferred. Uncorrectable errors are situations where too many bits flip (typically two or more) such that the ECC code can't detect exactly which bits flip. In low end and midrange systems, these UEs (uncorrectable errors) generally result in the need to reboot the computer system. However, in high end system such as the S/390® Enterprise Servers, uncorrectable errors are further classified depending on the originating location of the error. This, in turn, determines the type of recovery action taken. Recovery actions range from operating system machine checks to isolated CPs undergoing recovery to the entire system hard stopping and requiring IML (reboot).
Since the originating location of the error determines the type of recovery, recent high end systems such as the IBM® S/390® G5 and G6 Enterprise Servers have relied on a technique of storing special ECC code points (known as Special UEs) whenever an error is encountered in a data stream residing in main memory. During subsequent attempts to fetch that data, the data flow detects the Special UE and indicates to the processor that the line of storage in unusable. This technique allows the processor to differentiate between main memory storage errors, hardware errors in the shared Level 2 cache and surrounding subsystem, or errors occurring within the processor's own data flow. Although this scheme affords the advantage of invoking more granular types of recovery, the recent implementations have focused mostly on recovery from main memory storage errors and errors in the processor data flow, but have paid little attention to the remainder of the system.
There is a prevalent use of error and hardware defect detection circuitry in large computer systems. Many of today's systems are adept at identifying these errors and properly recovering from correctable error situations. However, lacking is a satisfactory way of managing uncorrectable errors for a variety of interfaces and storage elements, and conveying information about said errors to permit appropriate recovery actions.
U.S. Pat. No. 6,163,857, entitled Computer System UE Recovery Logic, issued to Meaney et al., provides for a UE recovery system with a cache, memory and central processors.
U.S. Pat. No. 5,953,351, entitled Method and Apparatus for Indicating Uncorrectable Data Errors, issued to Hicks et al., provides a means of generating Error Correction Code (ECC) check bits for the purposes of indicating Uncorrectable Errors (UEs) to the processor.
U.S. Pat. No. 4,761,783, entitled Apparatus and Method for Reporting Occurrences of Errors in Signals Stored in a Data Processor, issued to Christensen, et al., provides an apparatus for reporting errors that occur in either main storage or a storage element such as a cache.
U.S. Pat. No. 5,111,464, entitled Interrupt Reporting for Single Bit Errors, issued to Farmwald et al., focuses on an apparatus to detect memory errors and isolate the failing memory circuit. Furthermore, it teaches an intelligent method of reporting such that the processor is notified only when an occurrence of an error specifies a different circuit from that of the previous error occurrence. This method avoids unwanted processor interrupts resulting from repetitive accesses to the same memory locale in the event of a hardware failure.
U.S. Pat. No. 5,361,267, entitled Scheme for Error Handling in a Computer System, issued to Godiwala et al., focuses on a shared system bus design whereby data errors on memory read transactions can result in inefficiencies in the throughput of the system bus.
U.S. Pat. No. 5,535,226, entitled On-Chip ECC Status, issued to Drake et al., teaches a means of detecting errors within a memory storage element such as a dynamic random access memory (DRAM) and storing the status of said error detection.
U.S. Pat. No. 5,604,755, entitled Memory System Reset Circuit, issued to Bertin et al., provides a reset circuit for resetting a memory system following a radiation event.
These references include the use of a predetermined pattern of ECC Check Bits (sometimes referred to as Special UE or SPUE codes) to signal the occurrence of an uncorrectable error. Additionally, they all employ Error Correction Code (ECC) circuitry to detect and signal the presence of correctable and uncorrectable errors. Furthermore, most of them include a way of communicating a memory or storage UE to the processor. While all these aspects are important, they fail to provide a fully satisfactory solution.