The present invention generally relates to the detection of abnormalities in systems, and more particularly, to an autonomic method, apparatus, and computer program for collecting debug data during an abnormality.
In a well behaved system, the applications, and hence the overall system, perform within target performance expectations to meet established Service Level Agreements (“SLA”) under a service contract where a level of performance is formally defined. For a variety of reasons there will be occasions where the SLAs are not met causing in some instances application response times that are too high. These reasons may be volume related or there may be specific system or application issues.
Generally, in these circumstances, support personnel may be contacted for assistance in diagnosing the problem. Although, in order to diagnose the problem, the problem has to be detected and further actions generally need to be taken to establish what is actually happening within the system, causing the performance degradation. To establish what is occurring within the systems often times involves using realtime, or near realtime, systems management tools to help pinpoint which applications, if any, are suffering performance problems. These tools generally provide enough information to identify an application that is working outside its SLA performance target, but the tools may not give enough information to diagnose the real cause of the problem, such as locks, waiting for information (read/write), any other gaps potentially caused by system resources, or even flaws or bugs in program logic.
To provide the necessary information to diagnose the problem, in many circumstances, a system or application debug trace or a system or application performance monitoring trace is initiated. These traces collect debug and/or performance data that can be used by a system administrator or other support personnel to establish exactly what the application is doing and why the application is not meeting its normal performance targets. However, the application/system traces are typically manually started, allowed to run for some period time, and then manually stopped. This can lead to the traces running for longer than necessary to capture the problem, or may even miss capturing the data entirely.
This process may also contribute to extending the time for diagnosing the origin of the excessive application response times. Additional interactions may result between the support personnel and system administrator (or other customer) including for example, not only the initial contact, but also multiple requests for trace data if the traces were not manually started and stopped at the appropriate times, or false SLA failures due to the additional overhead of the traces from unnecessarily extended trace periods.