1. Technical Field
The present invention relates generally to an improved data processing system, and in particular to a method, system, and product for handling errors in a data processing system. Still more particularly, the present invention provides a method, system, and product for providing extended error handling (EEH) in host bridges.
2. Description of Related Art
A logical partitioned (LPAR) functionality within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in an LPAR system, these resources are disjointly shared among various partitions, themselves disjoint, each one appearing to be a stand-alone computer. These resources may include, for example, input/output (I/O) adapters, memory dimms, non-volatile random access memory (NVRAM), and hard disk drives. Each partition within the LPAR system may be booted and shutdown repeatedly without having to power-cycle the whole system.
In reality, some of the I/O devices that are disjointly shared among the partitions are themselves controlled by a common piece of hardware, such as a host Peripheral Component Interface (PCI) bridge, also referred to herein as a PHB, which may have many I/O adapters controlled by or below the bridge. Devices are coupled to the PHB utilizing these I/O adapters. This bridge may be thought of as being shared by all of the partitions that are assigned to its slots. Hence, if the bridge becomes inoperable, it affects all of the partitions that share the devices that are below the bridge. Indeed, the problem itself may be so severe that the whole LPAR system will crash if any partition attempts to further use the bridge. In other words, with a crash, the entire LPAR system fails. The normal course of action is to terminate the running partitions that share the bridge, which will keep the system from crashing due to this failure.
When an device error, also referred to herein as a device error, occurs, the PCI Host Bridge (PHB) to which the device is coupled assumes a non-usable, or error, state. This PHB then generates a machine check which in turn invokes a machine check interrupt (MCI) handler. The MCI handler reports the error and terminates all of the partitions to which the PHB is assigned. This process is a “normal” solution that prevents the whole LPAR system from crashing due to a device error.
A single PHB typically supports multiple slots each of which may be assigned to different partitions. When a device error occurs in a slot that is coupled to a PHB, the device error will cause the termination of the partition to which the faulty device is assigned and will also cause the termination of all other partitions to which the other slots of the PHB are assigned when the adapter that generated the error does not support extended error handling.
When a partition is terminated, it must be rebooted before it can be utilized again. Terminating and then rebooting a partition may result in the loss of critical data that was being processed when the error occurred and the partition was terminated.
The problem described above occurs when the faulty adapter that has the error does not support extended error handling (EEH). When a faulty adapter does support EEH, the EEH features prevent the device error from propagating from the slot to the PHB which supports the slot. When a faulty adapter does not support EEH, the device error propagates, as described above, from the slot to the PHB which supports the slot resulting in the termination of all partitions that share the PHB.
When a device supports EEH, the device itself processes and reports errors on its own without requiring the generation of a machine check or a termination of its associated partition and the other partitions.
Therefore, a need exists for a method, system, and product for providing a PHB which supports EEH when coupled to devices that support EEH so that an error that occurs in one device will not cause all of the partitions that share the PHB to be terminated.