Aspects of the present invention are directed to methods and tools to monitor and perform root cause analysis of temporary process wait situations.
Recently, it has been seen that computing systems are subject to intermittent performance degradations or wait situations in productive environments that have root causes that cannot be easily explained. For example, end users connected to a particular computing system may report that an application screen freezes for up to five minutes, but that this freeze happens only once or twice per week. Typically, in such a case, the system administrator would check system logs for any unusual entries, he/she may look at scheduled batch jobs and he/she may try mapping the events to a specific job. Alternatively, the system administrator may look at performance history data to see if the degradation can be linked to unusual high overall system load. Unfortunately, there may not be any obvious indications as to why the freeze occurs.
Analysis of problems similar to the one described above is even more difficult in complex applications that involve multiple servers and processes to process a single user request. Thus, trace tools have been provided to identify application slowdown instances based on “soft” conditions and to collect appropriate diagnostic material automatically.
While there are many trace tools available to analyze performance problems, their use may be impractical if the problems happen relatively rarely and only for short time periods. Trace tools collect a significant amount of data and add some overhead to the overall system utilization, so users may not be able to afford to activate traces over an extended time period. If, on the other hand, the trace tools are configured to operate in a wrap mode, operators may not be able to react quickly enough and stop the traces before the important data is overwritten by newer data.
Besides trace tools, users can use watchdog tools. Watchdog tools can perform diagnostic tasks in case of specific events and conditions. Examples for events are an abnormal end of a task or process, an error condition, such as I/O error, machine check or application error message, an execution of specific instructions, such as branch instructions and read or write access to specific storage locations. Examples for conditions are a process, task or job name, a program, module, or entry point name and contents of specific storage locations. If the problems cannot be linked to specific events and the responsible component in the application or system code is not known, however, watchdog tools cannot be used.