1. Field of the Invention
The present invention relates to computer systems sharing partitioned operating systems running on multiple processors. More particularly, the invention relates to a method and apparatus to recover from a failure occurring in a single processor without affecting the operations of the other partitions.
2. Background of the Related Art
One advantage of a large computer system is its ability to accommodate multiple users accessing the system at virtually the same time. For many reasons, a user may prefer using one operating system to another operating system. Therefore, computer systems are able to accommodate its users by being compatible with diverse operating systems running at the same time. With many users accessing the computer system at the same time, the system may comprise multiple processors in order to speed up its operations.
In a computer system running two or more operating systems on a shared basis, each operating system resides in a logical partition in computer memory. A logical partition may also include one or more processors and a processor may be shared among logical partitions. A partition manager manages the operations in any shared processor by scheduling and dispatching an operating system to a processor. The dispatching of an operating system to a processor either occurs through an interrupt request generated by the partition manager or when an operating system yields the processor when it enters an idle state.
A processor, however, may ignore an interrupt request from the partition manager by disabling its external interrupts. This may occur when an errant processor or malfunctioning operating system enters a condition where it is no longer functioning properly. For example, a processor may be repeatably executing a step that has no solution. This error condition is generally known in the art as a looped condition.
When a processor is in a looped condition, the processor may not be able to accept commands, through interrupts, from the partition manager in the normal manner. This is because the processor may not have any of its interrupts enabled due to the error condition. Without an additional recovery mechanism in the computer system, the partition manager can not take control of the processor.
Furthermore, without an additional mechanism in the computer system, the condition of a processor is unknown. As an illustration, a system user may be waiting for a response from an operating system; however, unknown to the user, the processor is in a looped condition.
Conventionally, a looped condition may last indefinitely or until the system user intervenes and by some manner corrects the problem. Until the system user identifies that a problem likely exists, other operating systems on the system are excluded from the use of the looped processor. In order to regain control of the looped processor, the entire computer system must be shut-down and re-started. A re-start operation significantly affects other system users by generating system down time.
Therefore, there is a need for a method and apparatus to monitor the condition of a processor running multiple operating systems controlled by a partition manager. There is also a need for a method and apparatus to generate a corrective response to re-set a processor detected in a looped condition so that it may resume normal operation.
A method and apparatus is provided for monitoring the run state condition of a plurality of processors in a computer system. In one embodiment, a computer system comprises a timestamp clock to generate a timestamp value. Each of the plurality of processors first reads the timestamp clock and stores the value read in a memory location. After waiting a period greater than one timestamp clock period, each of the processors reads the timestamp clock again and stores that value in another memory location. The second timestamp value then is compared with the first time stamp value and if it is unchanged the processor is found to be in a looped condition. A service processor then generates an interrupt signal to re-set the looped processor.
In another embodiment, a method and apparatus is provided for monitoring the run state condition of a plurality of processors in a computer system. Illustratively, the plurality of processors comprise a plurality of multi-threaded processors running a plurality of operating systems. Each operating system is contained in a logical partition managed by a partition manager. The partition manager includes a data structure comprising a plurality of memory locations. The memory locations are used to store the respective timestamp values for each of the plurality of processors. A service processor contains a timestamp clock used to generate time stamp values and place the timestamp values in a timestamp memory location. The service processor is configured to periodically read the values contained in the timestamp memory locations for each of the processors and compare subsequent timestamp readings. A period between subsequent timestamp readings by the service processor is greater than a period between timestamp clock readings by each processor. If the subsequent timestamp is unchanged from a previous timestamp, the respective processor is found to be in a looped condition. The service processor then generates an interrupt signal to re-set the looped processor to return it to it normal operating state.