The present invention relates to system maintenance and diagnosis, and more particularly to techniques for gathering diagnostic data that is relevant to a condition detected in a monitored system.
When a system encounters a failure or error, diagnostic data is typically collected and stored to a disk for diagnostic analysis. The diagnostic data may be communicated to a diagnosis site for analysis and resolution of the error. The amount of diagnostic data that is captured varies from one system to another. Using one conventional approach, all of the data associated with the system is gathered and stored to the persistent memory (e.g., a disk) for diagnostic purposes. The stored data is then communicated to a diagnosis site for analysis. Such an approach of complete diagnostic data gathering however consumes a lot of time and valuable system resources. Further, the amount of data that is collected may include thousands of files and many gigabytes of data. Sending such a large volume of data to the diagnosis site is cumbersome, time-consuming, and expensive. Further, if the data received at a diagnosis site is very large, it takes the vendor a long time to analyze the received diagnostic data to identify relevant pieces of data for analyzing a particular problem.
Alternatively, only a basic set of diagnostic data associated with the system is collected and stored during an initial diagnostic process. The diagnostic data gathered by the initial diagnostic process is then analyzed to determine what additional diagnostic processes have to be run to capture additional data that is more relevant to the specific failure and essential for error resolution. This iterative process continues until someone manually determines that sufficient data has been gathered to solve the problem. The second approach thus requires diagnostic data gathering to be performed over multiple stages. At the end of each stage, a manual determination has to be made if sufficient diagnostic data has been gathered. This process is very time-consuming and also error-prone due to its manual component. Thus, using either approach, the time needed to resolve the error is increased, leading to customer dissatisfaction.
As indicated above, several prior solutions for gathering diagnostic data rely on a human to gather the relevant diagnostic data for a failure, analyze the gathered diagnostic data, and determine if any additional data needs to be collected. For example, a system administrator of a software system may track the failures in the system and determine the diagnostic data to be gathered and sent to the software vendor for diagnostic analysis. Typically, the administrator has to manually decide and generate the diagnostic data that is needed for proper diagnosis of the failure. Gathering a sufficient amount of diagnostic data that is relevant for resolving a particular error usually takes several iterations including many round trips between the administrator and the software support/development organization. This results in a long resolution time for the failure or error. Further, because of the manual component and because system administrators can have different skill levels, the reliability of the data gathering process is not assured and not repeatable.
Certain diagnostic data gatherings may not be performed automatically but rather require customer inputs such as customer approval. In today's systems, there is no automated mechanism for managing such diagnostic data gatherings that require customer intervention or customer approval. Often, recommendations to take certain actions for diagnostic data gathering are instead communicated through the vendor's support organization, or through documentation.