This invention relates to reliable computer systems, and more particularly to monitoring computer system objects to improve system reliability.
Computer technology is continually advancing, continually providing new and expanded uses for computers. As such uses continue to grow and expand, the importance of computers and people""s reliance on their continued operation similarly grows. Currently, typical computer systems are xe2x80x9cmostly reliablexe2x80x9d. That is, most of the time computer systems operate as they are intended to. However, occasionally a computer system will xe2x80x9ccrashxe2x80x9dxe2x80x94an application terminates abnormally, the entire computer system xe2x80x9cfreezes upxe2x80x9d and will not respond to user input, etc. Such system crashes are typically resolved by the user either restarting the application that terminated abnormally, or alternatively by rebooting the entire system. While such system crashes can be annoying, the fact that the system is operating correctly most of the time is usually adequate for most computer systems, such as desktop computer systems.
However, in some settings or situations users expect a higher degree of system reliability, such that xe2x80x9cmostly reliablexe2x80x9d is insufficient. An example of such a system is a xe2x80x9cvehicle computerxe2x80x9d, which provides more conventional xe2x80x9cdesktop computerxe2x80x9d functionality to vehicle operators and occupants. Vehicle operators typically expect the same level of reliability from vehicle computers as they do from the other electronic systems in their vehicles (e.g., audio systems), which is virtually 100% reliability. However, typical computer systems are not able to provide such higher levels of reliability.
An additional problem that computer systems can face is that of diagnostics. In some settings (e.g., in vehicles) it is very difficult to diagnose system problems at the time the problem occurs because there are no diagnostic or debugging connections to the system. Without the ability to diagnose problems with the system when the problems occur, it is more difficult (e.g., for designers and service technicians) to determine what caused the problems and how to avoid them in the future.
The invention described below addresses these disadvantages, providing an improved way to monitor computer system objects to improve system reliability.
The invention concerns a computer system executing multiple objects (e.g., processes, threads, DLLs, etc.). The invention provides a way to improve the overall reliability of the computer system by carrying out various monitoring functions and taking various actions when problems are detected.
According to one aspect of the invention, objects can register with a critical process monitor for various types of monitoring. As part of the registration process the object provides the type of monitoring it would like the monitor to perform in order to detect a failure of the object. The object also provides a recovery action that should be taken in the event the monitor detects a failure of the object. Additionally, a callback function can be provided that is used by the monitor to inform the object that recovery is about to occur and give the object a chance to decline the recovery action. One such type of monitoring is a xe2x80x9cnotificationxe2x80x9d type, in which the object continues to send notification messages to the monitor within a specified time interval. If the monitor does not receive a notification message within the specified time interval, then it determines that the object has failed. Another type of monitoring is a xe2x80x9cwatchxe2x80x9d type, in which the monitor repeatedly checks whether the object is still executing. If the monitor detects that the object is no longer executing, then it determines that the object has failed.
According to another aspect of the invention, the monitor uses a xe2x80x9ctestxe2x80x9d thread to help verify that an object has failed. If the monitor determines that the object has failed because it is not receiving notification messages within the specified time interval, the monitor checks how frequently a test thread of the monitor is being scheduled. If the test thread is not being scheduled, then the monitor assumes that the object has not failed, but rather that another process or thread is consuming a significant amount of processor time and is preventing other objects from being scheduled.
According to another aspect of the invention, a watchdog logic is included in the computer system. The watchdog logic is programmed to reboot the computer if it is not accessed regularly. The critical process monitor refreshes the watchdog logic regularly to avoid having the computer system rebooted. However, if a system problem prevents the critical process monitor from running, then the watchdog logic reboots the computer system.
According to another aspect of the invention, memory heap size for each process is monitored by the critical process monitor. If the heap of a process grows beyond a threshold size, then the monitor logs the event for subsequent diagnostic use.
According to another aspect of the invention, an Application Programming Interface (API) provides the interface between the monitor and the objects in the computer system, allowing the objects to access the various features of the monitor.