1. Technical Field
The present invention is directed to data processing systems. More specifically, the present invention is directed to a method, apparatus, and computer program product for coordinating error reporting and reset in an I/O adapter that supports virtualization.
2. Description of Related Art
Large symmetric multi-processor data processing systems, such as IBM eServer P690, available from International Business Machines Corporation, DHP9000 Superdome Enterprise Server, available from Hewlett-Packard Company, and the Sunfire 15K server, available from Sun Microsystems, Inc. may be partitioned and used as multiple smaller systems. These systems are often referred to as logically partitioned (LPAR) data processing systems. A logical partition functionality within a data processing system allows multiple copies of a single operating system or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's physical resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the operating system image.
Each distinct operating system or image of an operating system running within a platform is protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This protection is provided by allocating a disjointed set of platform resources to be directly managed by each operating system image and by providing mechanisms for insuring that the various images cannot control any resources that have not been allocated to that image. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the operating system or each different operating system directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in a logically partitioned data processing system, these resources are disjointly shared among various partitions. These resources may include, for example, input/output (I/O) adapters, memory modules, non-volatile random access memory (NVRAM), and hard disk drives. Each partition within an LPAR data processing system may be booted and shut down over and over without having to power-cycle the entire data processing system.
Some known systems include firmware, also called a hypervisor, that manages and enforces the logical partitioning of the hardware. For example, a hypervisor may receive a request from the system to dispatch a virtual processor to a physical processor. The virtual processor includes a definition of the work to be done by a physical processor as well as various settings and state information that are required to be set within the physical processor in order for the physical processor to execute the virtual processor's work.
The various hardware devices, such as physical I/O adapters, can also be virtualized and thus shared among different logical partitions. When a hardware device is virtualized, it is logically divided into subdivisions. Each subdivision is considered to be a virtual version of the entire physical device.
For example, a particular physical I/O adapter may be virtualized into many different virtual I/O adapters. Each virtual I/O adapter may be assigned to and then used by a different logical partition. Each virtual I/O adapter is presented to a logical partition as if that virtual I/O adapter were the entire physical I/O adapter. In this manner, the virtual device is a logical substitute for the corresponding physical device.
Each logical partition will include its own device driver that is responsible for controlling its particular virtual I/O adapter. When a physical I/O adapter experiences a hardware error, the state of the physical I/O adapter may be different from the state that is expected by the device drivers that access a virtual I/O adapter that represents this particular physical I/O adapter. This difference between the state of the physical I/O adapter and its virtual I/O adapters could be propagated throughout the system resulting in errors in the system. Therefore, the hardware platform must prevent the propagation of errors that arise from this difference in the state of the virtual I/O adapters and their underlying physical I/O adapter.
One method for preventing the propagation of such errors is to “machine check”, also called “check-stop” each partition that uses a virtual I/O adapter that is based on this physical I/O adapter. The problem with this method is that the machine check occurs in a logical partition, the machine check terminates processing in that partition which usually causes a loss of all in-flight data when the machine check occurs.
Another method is for the I/O bus interface to initiate a “freeze mode”. When an I/O bus interface is in freeze mode, all physical I/O adapters that are coupled to that I/O bus interface are also in freeze mode. Any stores to a physical I/O adapter that is in freeze mode are discarded. Any loads from a physical I/O adapter that is in freeze mode will result in the return of a special code that indicates freeze mode instead of the expected data. Thus, if a device driver requests data from a virtual I/O adapter that represents a particular physical I/O adapter that is in freeze mode, the special code is returned to the device driver instead of data. This special code may be any predetermined value but is typically a bit combination of all logical ones.
Eventually, one or more device drivers will request a load from their virtual I/O adapter that is based on the physical I/O adapter that is in freeze mode, receive the special code, and then suspect that the physical I/O adapter is in freeze mode. A problem arises, however, because at this time not all of the partitions necessarily suspect that the underlying physical I/O adapter is in freeze mode. In fact, some of the partitions may suspect the underlying physical I/O adapter is in freeze mode while others may be actively attempting to store data to their virtual I/O adapters and thus to that physical I/O adapter.
The prior art provides no method for the device drivers that use virtual I/O adapters that are based on an underlying physical I/O adapter that is in freeze mode to coordinate a recovery from the freeze mode state across logical partition boundaries. Coordination problems may arise because not all of the partitions know about the freeze mode condition or because one of the partitions may not properly execute its role in the recovery process.
Therefore, a need exists for a method, apparatus, and computer program product for coordinating error reporting and reset in an I/O adapter that supports virtualization.