1. Technical Field
This invention relates generally to system management of computer programs, and more particularly, to detecting error conditions such as time-outs within computer programs.
2. Description of the Related Art
A system management (SM) agent is responsible for monitoring and controlling various computer programs, performing failure recovery, and improving overall system reliability. In particular, SM agents can detect different error conditions within computer programs and computer program processes. Conventional solutions for managing system resources have incorporated a variety of mechanisms. One solution has been to identify system processes which have become idle and are no longer in use. For example, U.S. Pat. No. 6,157,928 to Sprenger et. al. teaches that system resources can be managed or released by destroying particular agent processes which have been idle for a certain period of time. Other systems attempt to monitor for error conditions by monitoring the amount of time a particular process requires for execution. For example, Japanese Patent No. JP 09-179754 discloses a control mechanism for an operating system which can detect when a process has taken too long to complete. Similarly, Japanese Patent No. JP 08-263325 discloses a method of detecting a timeout condition and releasing resources in a client-server solution to prevent overload of the server.
While many management systems have focused upon the concept of monitoring system processes, such solutions fall short with respect to managing multi-threaded computer programs. For example, a process can include a plurality of individual tasks, each of which can execute within a separate thread of execution. Although conventional management systems can determine which process experienced an error, such systems offer little insight as to which task of a larger process is responsible for causing an error condition in a computer program.
One attempt at monitoring a computer program process is referred to as monitoring “heartbeats”. A heartbeat can be a simple, low-priority thread of execution that is started when a computer program starts. The heartbeat continues to execute while the computer program executes. Periodically, the heartbeat sends a message to the SM agent, informing it that the computer program is still functioning properly. Typically, the SM agent expects a message within a certain amount of time or the SM agent will consider the computer to have entered a time-out or other error condition.
Heartbeat monitoring can provide an indication of when an entire computer program process has timed-out or experienced an error condition. Still, this technique does not work well within the context of multi-threaded computer programs. For example, heartbeats typically execute within individual threads of execution. Accordingly, one heartbeat does not reflect the fact that another thread of execution has timed-out. This can be the case despite the fact that both heartbeats can correspond to a common larger process.
Monitoring of heartbeats further requires additional system resources. This overhead can be burdensome on a system, and can be particularly wasteful in the case where a computer program itself is idle, but the heartbeat continues. In such cases, although the computer program consumes little if any resources, the monitoring of the computer program's heartbeat continues to consume system resources. A similar situation arises when a portion of a computer program that is unlikely to experience error conditions is continually monitored. Oftentimes, computer programs are subject to errors or time-outs only in particular isolated or critical phases of execution. For example, during an initialization phase, some computer programs can depend upon other local or remote components to complete a separate task or process. Thus, if the local or remote component times-out, the starting thread in the relying computer program can time-out as well. In these situations, where an error condition can be more likely to arise, monitoring of a task or process can be beneficial. Continued monitoring of a task or process when an error condition is unlikely to occur, however, can deplete system resources and cause decreased system performance.