1. Field of the Invention
The present invention relates to a method and system for handling errors in a system managed by a processor and, in particular, a system for handling errors in a bridge system interfacing the processor with an external device, such as a computer system.
2. Description of the Related Art
The Peripheral Component Interconnect (PCI) bus is a high-performance expansion bus architecture that was designed to replace the traditional ISA (Industry Standard Architecture) bus. A processor bus master communicates with the PCI local bus and devices connected thereto via a PCI Bridge. This bridge provides a low latency path through which the processor may directly access PCI devices mapped anywhere in the memory or I/O address space. The bridge may optionally include such functions as data buffering/posting and PCI central functions such as arbitration. The architecture and operation of the PCI local bus is described in "PCI Local Bus Specification," Revisions 2.0 (April, 1993) and Revision 2.1s, published by the PCI Special Interest Group, 5200 Elam Young Parkway, Hillsboro, Oreg., which publication is incorporated herein by reference in its entirety.
A PCI to PCI bridge provides a connection path between two independent PCI local busses. The primary function of the bridge is to allow transactions between a master on one PCI bus and a target device on another PCI bus. The PCI Special Interest Group has published a specification on the architecture of a PCI to PCI bridge in "PCI to PCI Bridge Architecture Specification," Revision 1.0 (Apr. 10, 1994), which publication is incorporated herein by reference in its entirety. This specification defines the following terms and definitions:
initiating bus--the master of a transaction that crosses a PCI to PCI bridge is said to reside on the initiating bus. PA1 target bus--the target of a transaction that crosses a PCI to PCI bridge is said to reside on the target bus. PA1 primary interface--the PCI interface of the PCI to PCI bridge that is connected to the PCI bus closest to the CPU is referred to as the primary PCI interface. PA1 secondary interface--the PCI interface of the PCI to PCI bridge that is connected to the PCI bus farthest from the CPU is referred to as the secondary PCI interface. PA1 downstream--transactions that are forwarded from the primary interface to the secondary interface of a PCI to PCI bridge are said to be flowing downstream. PA1 upstream--transactions forwarded from the secondary interface to the primary interface of a PCI to PCI bridge are said to be flowing upstream.
The PCI architecture provides for the detection and signaling of both parity and other system errors. The error reporting chain from target to bus master to device driver and eventually to the operating system is intended to allow error recovery operations to be implemented at any level. The generation of the SERR signal could generate an NMI, high priority interrupt signal. The SERR signal is generally used to signal address parity errors and/or other non-parity errors. Any PCI agent can set the SERR error by setting a bit in the configuration space register, such as the Status register.
The PCI bridge must detect address parity errors for all transactions on either a primary or secondary interface. The PCI bridge reports the error by asserting the SERR signal and propagating the SERR signal upstream. For instance, if the bridge detects an address parity error on the primary or secondary interface, the bridge asserts the SERR signal on the primary interface, sets the SERR bit in the Status register, sets a Detected Parity Error bit in either the Status register or Secondary Status register and may signal a target abort by setting a target abort signal register. Another error is the PERR or parity error that the PCI bridge uses to signal a data parity error.
The agent detecting an error may also terminate with a master abort mode by setting a master abort bit. When a read transaction with an address parity error crosses a PCI to PCI bridge and is terminated by a master abort, the bridge will return FFFF FFFFh to the initiator and terminate the read transaction on the initiating bus. When a write transaction is terminated with a master abort, the bridge will complete the write transaction on the initiating bus and discard the write data.
In current systems, a processor functions as the master that controls the PCI to PCI bridge system. One problem with current systems is that when the master processor attached to the PCI system receives an SERR, PERR or other error signal, the operating system of the processor enters a machine check handling mode to diagnose and check the error. However, upon entering the machine check handling mode, the processor would hang-up because the machine check handling logic is designed to handle errors in the processor and is typically not capable of diagnosing errors generated from an external system, such as a PCI to PCI bridge network. Because the machine check handling mode for the processor cannot process an error from the external PCI bridge system, the processor system will hang-up and crash. As a result of this crash, data maybe be lost and the system will be down while the processor is rebooting. In large scale systems, such as the IBM 3990 storage controller which manages critical data, rebooting can take up to twenty minutes. The loss of data and down time resulting from having to reboot the system can be especially costly for such storage controllers that manage critical data. Machine check handling for storage controllers is described in IBM publication "ESA/390 Principles of Operation," document no. SA22-7201-04 (Copyright IBM Corp. 1990, 1991, 1993, 1994, 1996, 1997), which publication is incorporated herein by reference in its entirety.
Moreover, there is typically a delay time from when an error is generated to when the processor interprets the error interrupt to perform error diagnosis and correction operations. During this delay, the processor may be processing numerous input/output (I/O) requests. Such I/O processing could cause further errors and problems to propagate through the PCI to PCI bridge system before the processor proceeds to address the error.