1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for error analysis in a data processing system. Still more particularly, the present invention provides a method and apparatus for enhancing input/output error analysis in a hierarchical hardware sub-system in a logical partitioned data processing system.
2. Description of Related Art
A logical partitioned (LPAR) functionality within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in a LPAR system, these resources are disjointly shared among various partitions, themselves disjoint, each one seeming to be a stand-alone computer. These resources may include, for example, input/output (I/O) adapters, memory dimms, non-volatile random access memory (NVRAM), and hard disk drives. Each partition within the LPAR system may be booted and shutdown over and over without having to power-cycle the whole system.
In reality, some of the I/O devices that are disjointly shared among the partitions are themselves controlled by a common piece of hardware, such as a host Peripheral Component Interface (PCI) bridge, which may have many I/O adapters controlled or below the bridge. The host bridge and the I/O adapters connected to the bridge form a hierarchical hardware sub-system within the LPAR system. Further, this bridge may be thought of as being shared by all of the partitions that are assigned to its slots.
A host bridge contains mapping and control and status registers. The mapping registers allow a partition or other process to see or view I/O adapters, which may be located in slots below the host bridge. When an error occurs in a device, such as an I/O adapter connected to the host bridge, the mapping registers are typically frozen for use by an analysis routine. In this type of system, the mapping registers are disabled when an error occurs in any device mapped under the host bridge. This disabling or freezing of the mapping registers often makes the diagnosis of the original problem difficult and sometimes impossible. Additionally, any type of run-time correction of transient errors is impossible with this type of system. At a minimum, this problem causes several devices to be identified as bad and may be undesirable by a customer.
Therefore, it would be advantageous to have an improved method, apparatus, and computer instructions for enhancing I/O error analysis of hierarchical hardware sub-systems.