1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for handling errors in a data processing system. Still more particularly, the present invention provides a method and apparatus for handling errors in a multiprocessor computer system, and in particular a logically-partitioned computer system.
2. Description of Related Art
A logical partitioned (LPAR) functionality within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and I/O adapter bus slots. The partition's resources are represented by the platform's firmware to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in a LPAR system, these resources are shared among various partitions in a mutually-exclusive fashion. That is, a single resource may be allocated to one partition at any one time, but any given resources may allocated to any one of the partitions. This results in each partition behaving as if it were a stand-alone computer. Among the resources that may be shared are input/output (I/O) adapters, random-access memory (RAM), non-volatile random access memory (NVRAM), and hard disk drives, although this list is by no means exhaustive. Each partition within the LPAR system may be booted and shut down over and over without having to cycle the power to the whole system.
Groups of I/O devices may be controlled by a common piece of hardware, such as a host Peripheral Component Interface (PCI) bridge, which may have many I/O adapters controlled or below the bridge. This bridge may be thought of as being shared by all of the partitions that are assigned its slots. Hence, if the bridge becomes inoperable, it affects all of the partitions that share the devices that are below the bridge. Indeed, the problem may be so severe that the whole LPAR system will crash if any partition attempts to further use the bridge. In other words, the entire LPAR system will fail. The normal course of action in this circumstance is to terminate the running partitions that share the bridge. This will keep the system from crashing due to this failure.
What usually occurs is an I/O adapter failure that causes the bridge to assume a non-usable (error) state. At the time of occurrence, the I/O failure invokes a machine check interrupt handler (MCIH), which, in turn, will report the error and then terminate the appropriate partitions. This process is a “normal” solution that prevents the whole LPAR system from crashing due to this problem.
Certain resources in an LPAR system, however, may be shared among all of the partitions. For instance, some LPAR systems include an area of “scratchpad” memory that is shared among all partitions. If a bus failure or adapter failure occurs on the bus to which the scratchpad is connected, the whole system will be brought down, since the affected scratchpad area is shared among all of the partitions. Thus, it would be desirable if there were a way to address a fault on such a critical datapath without bringing the entire system down.