1. Technical Field
The present invention relates generally to the field of computer architecture and, more specifically, to methods and systems for managing machine check interrupts during runtime.
2. Description of Related Art
A logical partitioning option (LPAR) within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping sub-set-of the platform""s resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition""s resources are represented by its own open firmware device tree to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition can not affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images can not control any resources that have not been allocated to it. Furthermore, software errors in the control of an OS""s allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
Currently, in both LPAR systems and non-partitioned systems, when a machine check occurs due to a bad I/O adapter in the system, data about the condition causing the machine check is presented to the operating system in the form of an error log entry. The operating system then performs a complete shutdown of the system. The user must then replace the bad I/O adapter and then reboot the system. Such a requirement may not be terribly problematic for users with a simple configuration in which a reboot is relatively quick or for users in which having the system available at all times is not critical. However, for other users with complex configurations, such as, for example, multiple racks of serial storage architecture (SSA) or networked systems, a considerable amount of time will be spent rebooting the system just to replace one bad I/O adapter. Such expenditure of time may be very costly for those users. For example, if the system is a web server critical for taking internet sales orders for products, such as, for example, books or compact disks (CDs), each minute of time that the system is shut down to replace a bad I/O adapter may result in many thousands of dollars in lost sales.
Therefore, a method and system for replacing bad I/O adapters without the need for powering down or rebooting the system would be desirable.
The present invention provides a method, system, and apparatus for managing a failed input/output adapter within a data processing system. In one embodiment, an operating system handler receives an indication that one of a plurality of input/output adapters has failed. The operating system handler consults an error log to determine which input/output adapter has failed. Once the bad input/output adapter has been determined, the operating system handler disables the bad input/output adapter and deallocates any processes bound for the bad input/output adapter without powering down the data processing system. A user is then notified of the bad input/output adapter so that the bad input/output adapter can be replaced. The input/output adapter may be replaced without powering down the data processing system. Once the bad input/output adapter has been replaced, the new input/output adapter is enabled.