1. Technical Field
The present invention relates generally to error handling in data processing systems, and in particular, to a system and method for recovering from an error occurring in association with a bus or interconnect that supports hot plugging.
2. Description of the Related Art
Computer systems generally comprise multiple devices interconnected by one or more buses or interconnects. For example, a typical computer system architecture includes a processor connected to main memory through a host or memory bus. The host or memory bus is connected to an expansion bus. In many computer architectures, the expansion bus is typically a local bus such as a peripheral component interconnect (PCI) bus. PCI generally designates a physical and logical interconnection between a host entity, such as a processor or bus bridge, and devices attached via so-called expansion slots designed for high-speed operation.
PCI bus architecture development has been instrumental to developments in overall system architecture. The PCI-X standard is one such development directed toward improving performance of high-bandwidth devices such as gigabit Ethernet cards, Fibre Channel, Ultra3 Small Computer System Interface (SCSI), processors interconnected as a cluster, etc.
A problem associated with bus architectures such as PCI relates to error handling. When an error is detected on the bus, such as in a PCI card slot, the address at which the failure occurred in logged. This is typically accomplished by error logging circuitry in the host bridge (or PCI to PCI bridge). When, for example, a PCI bus parity error (PERR) or system error (SERR) occurs on any PCI-X bus on a server system, the conventional error handling technique proceeds as follows. First, a system management interrupt (SMI) is generated in response to the PERR/SERR condition. An SMI handler is invoked and scans all PCI-X devices in the system to identify the problematic device. The SMI handler typically records the error in an error log maintained on a bus bridge or adapter device and asserts a user indicator such as an LED indicator corresponding to the identified device. Finally, the SMI handler requests a service processor to generate a non-maskable interrupt (NMI) that results in a system reboot procedure.
For many computer systems such as high-end servers, any unscheduled system shutdown is costly. Accordingly, there is a need for handling PCI bus errors that enables the system to continue reliable operations in a manner that avoids rebooting the system.