1. Technical Field of the Invention
The present invention relates in general to the field of computer systems, and in particular, by way of example but not limitation, to operating system hang detection and correction within a computer system.
2. Description of Related Art
Despite advances in computer and operating system architectures, computer systems continue to be vulnerable to operating system hang conditions from time to time. The primary cause of this vulnerability is that operating system hang conditions may occur for a wide variety of reasons that are difficult to predict and even more difficult to completely avoid. For example, an operating system hang condition may occur due to insufficient system resources, incompatible use of the available resources, incompatible device drivers or errors in the operating system or application software. Furthermore, the particular configuration of the system may produce an operational state that the operating system was not originally designed to handle, or may continuously generate an event, such as a hardware or software interrupt, that the operating system cannot clear using available methods. As a result, the operating system may enter a continuous loop or an unknown operational state from which the operating system cannot recover without some form of intervention, such as a system reset.
The problems associated with operating system hang conditions are exacerbated by the fact that the user typically cannot distinguish between an operating system hang condition and an unusual processing delay. As a result, the user may experience uncertainty and/or frustration in attempting to determine whether an operating system hang condition has occurred. Inexperienced users, for example, may wait an inordinate amount of time for the computer system to respond to user input, unaware that the operating system is no longer functioning. Furthermore, in applications where the computer system is intended to continuously operate for long periods of time, such as Web servers, file servers, database servers or network servers, the failure to detect an operating system hang condition can become especially problematic. Because these computer systems typically perform tasks that are critical to an organization""s business operations, system downtime caused by the failure to detect an operating system hang condition may be unacceptable.
Existing approaches for detecting operating system hang conditions have proven to be inadequate or unreliable in that they typically rely on user observation of system activity. For example, the user may attempt to determine whether an operating system hang condition has occurred by monitoring for hard drive activity, by testing for pointer and/or keyboard responsiveness or by actuating the xe2x80x9cNumLockxe2x80x9d key to determine if the key""s associated LED changes state. These approaches, however, have limited reliability in that they rely upon the subjective judgment of the individual user which varies significantly depending upon the user""s level of experience. Furthermore, these approaches require the presence of a human operator to perform physical observations of system activity and to perform a system reset in the event an operating system hang condition is detected. As a result, these approaches may be completely inadequate for detecting operating system hang conditions in server-based applications where the physical observation of system activity by a human operator may not be a viable or cost effective option.
Therefore, in light of the deficiencies of existing approaches, there is a need for a mechanism that detects and possibly corrects operating system hang conditions in a more reliable and cost effective manner.
The deficiencies of the prior art are overcome by the method, system and apparatus of the present invention. For example, as heretofore unrecognized, it would be beneficial to detect an operating system hang condition by setting a status flag to a first value, generating an operating system interrupt intended for an operating system interrupt handler within an operating system kernel that resets the status flag to a second value, executing the operating system interrupt handler if the operating system kernel is responding to the operating system interrupt and executing a system BIOS interrupt handler that measures a time interval in which the status flag is set to the first value without being reset to the second value. If the measured time interval exceeds a predetermined threshold, an operating system hang condition is presumed to have occurred and an appropriate procedure may be called that, for example, informs the user of an operating system malfunction, automatically performs a system reset, corrects the problem causing the operating system hang condition or performs combinations thereof.
In a first and preferred embodiment of the present invention, a timer is configured to set a status flag to a first value and generate an operating system interrupt in response to an overflow of the timer. The operating system interrupt is associated with a timer interrupt handler within an operating system kernel that functions to reset the status flag to a second value if the operating system kernel is responding to the operating system interrupt. A xe2x80x9cwatchdogxe2x80x9d timer is also configured to periodically generate a system BIOS interrupt, where the system BIOS interrupt is associated with a watchdog timer handier. When the watchdog timer handler gains control of the processor, the watchdog timer handler increments a watchdog counter in response to the status flag having the first value and clears the watchdog counter in response to the status flag having the second value. If the watchdog counter exceeds a predetermined threshold, an operating system hang condition is presumed to have occurred and an appropriate procedure may be called that, for example, informs the user of an operating system malfunction, automatically performs a system reset, corrects the problem causing the operating system hang condition, or performs combinations thereof
The technical advantages of the present invention include, but are not limited to, the following exemplary technical advantages. It should be understood that particular embodiments may not involve any, much less all, of the following exemplary technical advantages.
An important technical advantage of the present invention is that it better enables a user to detect an operating system hang condition by utilizing a more reliable detection mechanism.
Another important technical advantage of the present invention is that it provides a cost effective mechanism for detecting an operating system hang condition by eliminating the need for physical observation of system activity by a human operator.
Yet another important technical advantage of the present invention is the ability to reduce uncertainty and/or frustration of a user by ensuring that the user is informed of an operating system hang condition so that the user may take appropriate action.
Yet another important technical advantage of the present invention is the ability to reduce system downtime by providing a mechanism that can correct an operating system hang condition and/or automatically perform a system reset in response to detection of an operating system hang condition.
The above-described and other features of the present invention are explained in detail hereinafter with reference to the illustrative examples shown in the accompanying drawings. Those skilled in the art will appreciate that the described embodiments are provided for purposes of illustration and understanding and that numerous equivalent embodiments are contemplated herein.