1. Technical Field
The present invention relates in general to the field of data processing systems, and in particular to an improved system and method for managing processes in a data processing system.
2. Description of the Related Art
Logical partitioned (LPAR) functionality within a data processing system allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's resources. These resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by the platform's firmware to the operating system image.
Each distinct operating system or operating system image running within the platform is protected from each other distinct operating system or operating system image such that software errors in one logical partition cannot affect the correct operation of any of the other partitions. The protection is provided by allocating a disjoint set of platform resources to be directly managed by each operating system image and by providing mechanisms for ensuring that a given operating system image cannot control any resources that have not been allocated to that given operating system image. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each operating system image (or each different operating system) directly controls a distinct set of allocable resources within the platform.
With respect to hardware resources in a LPAR data processing system, these resources are disjointly shared among various partitions, themselves disjoint, each one appearing to be a stand-alone computer. These resources may include, for example, input/output (I/O) adapters, dual-inline memory modules (DIMMs), non-volatile random access memory (NVRAM), and hard disk drives. Each partition within the LPAR data processing system may be booted and shutdown without having to power-cycle the whole system.
In a LPAR data processing system, the different partitions include partition firmware, which is used in conjunction with the operating systems in the partitions. As well-known in the art, LPAR data processing systems also enable the partition firmware to run threads simultaneously. The partition firmware can perform tasks that often require extended execution times without causing interrupt and OS timer problems. When a task is requested by the OS, the firmware first runs a small layer of partition firmware code. The partition firmware code issues a call/event to a hypervisor to perform the requested task. The hypervisor, which is also known as a “virtual machine monitor”, enables multiple operating systems to run simultaneously on a data processing system by acting as an arbitrator between the multiple partitions. After the event has been requested, the partition firmware code returns to the OS with a status of “BUSY”. The OS recognizes the firmware has not finished collecting the requested data because of the “BUSY” status and the OS queries the firmware again.
The constant querying of the partition firmware is continued until the hypervisor has completed the asynchronous event (also referred to herein as a “hypervisor task”). Once complete, the hypervisor places the requested data into the partition firmware's memory region and returns control to the partition firmware code for further data refinement.
Those with skill in the art will appreciate that often, the hypervisor task that was supposed to be collecting data for the OS fails in such a way that the hypervisor task is not capable of responding to the partition firmware queries. The partition firmware code constantly returns a “BUSY” status to the OS while the OS constantly queries the partition firmware. The constant queries result both in degraded performance of the overall system and a hung process if the hypervisor task responsible for servicing the request for data has stopped operating.
As is well-known in the art, one solution to the constant query problem is to implement a timer that expires after a predetermined period of time. Once the timer expires, the OS can fail any request that has not been fulfilled. However, utilizing a timer introduces a difficulty in determining a correct period in which to set the timer. If the period is set at a short time period, the OS can fail hypervisor tasks that are still working to retrieve data, but have not completed retrieving the data. If the timer period is set at a longer time period, the OS can prevent new requests from initiating. Therefore, there is a need for a system and method for probing hypervisor tasks in an asynchronous environment in a data processing system.