1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for processing errors. Still more particularly, the present invention provides a method and apparatus for processing input/output errors in a logical partitioned data processing system.
2. Description of Related Art
A logical partitioned (LPAR) functionality within a data processing system or platform allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in a LPAR system, these resources are disjointly shared among various partitions, themselves disjoint, each one seeming to be a stand-alone computer. These resources may include, for example, input/output (I/O) adapters, memory dimms, non-volatile random access memory (NVRAM), and hard disk drives. Each partition within the LPAR system may be booted and shutdown over and over without having to power-cycle the whole system.
In reality, some of the I/O devices that are disjointly shared among the partitions are themselves controlled by a common piece of hardware, such as a host Peripheral Component Interface (PCI) bridge, which may have many I/O adapters controlled or below the bridge. The host bridge and the I/O adapters connected to the bridge form a hierarchical hardware sub-system within the LPAR system. Further, this bridge may be thought of as being shared by all of the partitions that are assigned to its slots. One or more of these host bridges are in turn connected to an I/O bridge, which is used by the processors to access the different I/O sub-systems.
Presently, when errors occur, timing windows are present during which operations such as identifying an error, clearing registers, and analyzing errors occur. The existence of these timing windows may allow secondary hardware I/O errors to be reported as a generic unrecoverable error and may allow primary I/O errors that fall within a critical timing window to go unreported. These undetected or misdiagnosed errors result in the I/O hardware causing the errors to go unmarked or unidentified. In a subsequent cycle, this hardware may be accessed again and may lead to a subsequent system crash.
Therefore, it would be advantageous to have an improved method, apparatus, and computer instructions for eliminating timing windows that allow undetected errors in an I/O sub-system.