Various approaches have been proposed for handling errors or failures in computers. Examples include U.S. Pat No. 6,170,067, System for Automatically Reporting a System Failure in a Server (Liu et al., Jan. 2, 2001); it involves monitoring functions such as cooling fan speed, processor operating temperature, and power supply. However, this example does not address software errors. Another example is U.S. Pat. No. 5,423,025 (Goldman et al., Jun. 6, 1995); it involves an error-handling mechanism for a controller, in a large-scale computer using the IBM ESA/390 architecture. In the above-mentioned examples, error-handling is not flexible; error-handling is not separated from hardware, and there is no dynamic tuning.
Unfortunately, conventional problem-solving for software often involves prolonged data-gathering and debugging. Collection of diagnostic data, if done in conventional ways, may impact software performance in unacceptable ways, and may have to be repeated several times until a problem's cause is revealed. Thus there is a need for automated solutions that provide useful diagnostic data, leading to a useful response; at the same time, the burdens of reproducing problems and tracing problems need to be reduced, and the destabilizing effects of major code revisions need to be avoided.