1. Technical Field
The present invention is directed to an improved data processing system. More specifically, the present invention is directed to a logical partition management apparatus and method for handling system reset interrupts (SRIs).
2. Description of Related Art
Logical partitioning (LPAR) is a system structure which allows a symmetric multiprocessor (SMP) system to be subdivided into “partitions,” each of which contains the necessary processor, memory, and input/output (I/O) resources to run an operating system (OS) image. LPAR provides easy redeployment of computing resources to support changing workloads without the need for physical restructuring, flexible growth to accommodate increased workloads, and large, scalable single-system-image enterprise systems.
Because LPAR breaks the traditional model of one operating system running on one hardware platform, LPAR generates the need for a set of platform management functions that operate outside the scope of any single operating system image. This need has been met by the introduction of a set of platform management functions implemented in firmware.
These platform management functions have been implemented in a firmware hypervisor. The hypervisor is a firmware resident application, or set of applications, that manages virtual machines and logical partitions. The hypervisor is responsible for many aspects of partition management including allocating resources to a partition, installing an operating system in a partition, starting and stopping the operating system in a partition, dumping main storage of a partition, communicating between partitions, and other partition management functions.
In logical partitioned computing systems, a partition processor normally makes many hypervisor calls for services. The hypervisor implements many software locks to enforce mutually exclusive accesses for updating hypervisor data structures used to maintain the partition from which the processor belongs, and for using hardware resources shared among all partitions in the system.
During any hypervisor call, an asynchronous hardware system reset interrupt (SRI) may occur. An SRI is similar to a virtual pressing of the “reset” button on a computer. That is, rather than actually pressing the reset button and thereby sending a reset signal to all processors of the entire computing system, a virtual reset button is provided for each partition. In this way, a system reset interrupt may be generated for a partition and thereby, only the partition is reset. Thus, the partition may be rebooted without having to reboot other partitions in the computing system.
If an SRI occurs during a hypervisor call, the SRI will disrupt and end the hypervisor call. Moreover, any software locks obtained by the partition processor executing the hypervisor call will be held indefinitely and become dead locks. That is, even though the partition has received the SRI, the hypervisor data structures will still indicate that the partition processor has a lock on a shared resource if the hypervisor call is prematurely ended. Since the hypervisor call cannot be completed, the lock will never be released. This causes a problem in that other processors in the multiprocessor system will not be able to obtain access to the system resources locked by the partition processor. These other processors will become “starved” by continuing to try to obtain a lock on the system resources, i.e. spinning on the lock, and never being able to perform the necessary work requiring the lock on the system resource. Additionally, if the hypervisor call is prematurely ended while updating important data structures, the integrity of these data structures may jeopardize the normal operation of the hypervisor.
Therefore, it would be beneficial to have a mechanism for avoiding dead locks due to the occurrence of an SRI during a hypervisor call.