1. Technical Field
The present invention relates generally to an improved data processing system, and in particular to a method, system, and product for handling errors in a data processing system. Still more particularly, the present invention provides a method, system, and product for improving isolation of I/O errors in logically partitioned data processing systems.
2. Description of Related Art
A logical partitioned (LPAR) functionality within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in an LPAR system, these resources are disjointly shared among various partitions, themselves disjoint, each one appearing to be a stand-alone computer. These resources may include, for example, input/output (I/O) adapters, memory dimms, non-volatile random access memory (NVRAM), and hard disk drives. Each partition within the LPAR system may be booted and shutdown repeatedly without having to power-cycle the whole system.
In reality, some of the I/O devices that are disjointly shared among the partitions are themselves controlled by a common piece of hardware, such as a host Peripheral Component Interface (PCI) bridge, which may have many I/O adapters controlled or below the bridge. This bridge may be thought of as being shared by all of the partitions that are assigned to its slots. Hence, if the bridge becomes inoperable, it affects all of the partitions that share the devices that are below the bridge. Indeed, the problem itself may be so severe that the whole LPAR system will crash if any partition attempts to further use the bridge. In other words, with a crash, the entire LPAR system fails. The normal course of action is to terminate the running partitions that share the bridge, which will keep the system from crashing due to this failure.
When an I/O adapter error occurs, the PCI Host Bridge (PHB) to which the I/O adapter is coupled assumes a non-usable, or error, state. This PHB then generates a machine check which in turn invokes a machine check interrupt (MCI) handler. The MCI handler reports the error and terminates the partitions to which the PHB is assigned. This process is a “normal” solution that prevents the whole LPAR system from crashing due to an I/O adapter error.
A single PHB typically supports multiple slots each of which may be assigned to different partitions. When an I/O adapter error occurs in a slot that is supported by a PHB which also supports other slots which are assigned to different partitions, the I/O adapter error will cause the termination of the partition to which the faulty I/O adapter is assigned and will also cause the termination of other partitions to which the other slots of the PHB are assigned when the adapter that generated the error does not support extended error handling. The problem described above occurs when the faulty adapter that has the error does not support extended error handling (EEH). When a faulty adapter does support EEH, the EEH features prevent the I/O adapter error from propagating from the slot to the PHB which supports the slot. When a faulty adapter does not support EEH, the I/O adapter error propagates, as described above, from the slot to the PHB which supports the slot.
When an error as a result of a faulty adapter that does not support EEH is allowed to propagate to the PHB, the PHB enters a “freeze” mode that causes all further accesses to any slot supported by the PHB to fail. Thus, a single error that should have affected only one partition ends up propagating across one or more other partitions which should have been independent of each other.
When an error occurs, a service call is made which indicates each field replacement unit (FRU) that must be replaced in order to clear the error. When the PHB enters the freeze mode as a result of an adapter that does not support EEH, the FRU calls out each slot, any device coupled to each slot, as well as the system planar. Thus, an FRU calls out each slot, device, and system planar even though only one I/O slot may have generated the error. Obviously, it is much more expensive to replace all of these hardware components instead of just replacing the faulty I/O adapter.
Therefore, a need exists for a method, system, and product for improving isolation of I/O errors in logical partitioned data processing systems by identifying only occupied slots that have adapters that do not support EEH.