The present invention relates to system maintenance and diagnosis, and more particularly to a diagnosability system for collecting, storing, communicating, and analyzing diagnostic information for a monitored system.
When a system encounters a failure or error, diagnostic data is typically collected and stored to a disk for diagnostic analysis (also referred to as dumping diagnostic data to a disk). The diagnostic data may be communicated to a diagnosis site for analysis and resolution of the error. The amount of diagnostic data that is gathered and stored (also referred to as diagnostic data dumps) varies from one system to another. Using one conventional approach, all of the data associated with the system is gathered after every error and stored to the persistent memory (e.g., a disk) for diagnostic purposes. The stored data is then communicated to a diagnosis site for analysis. Such an approach of complete diagnostic data gathering however consumes a lot of time and valuable system resources. Further, the amount of data that is collected may include thousands of files and many gigabytes of data. Sending such a large volume of data to the diagnosis site is cumbersome, time-consuming, and expensive. Further, if the data received at a diagnosis site is very large, it takes the vendor a long time to analyze the received diagnostic data to identify relevant pieces of data for analyzing a particular problem. This increases the amount of time needed to diagnose the error or problem.
In some other systems, only a minimally basic set of diagnostic data associated with the system is collected and stored upon occurrence of an error during an initial diagnostic process. The diagnostic data gathered by the initial diagnostic process is then analyzed, generally manually, to determine what additional diagnostic processes have to be run to capture additional data that is more relevant to the specific failure and essential for error resolution. This iterative process continues until someone manually determines that sufficient data has been gathered to solve the problem. This second approach causes diagnostic data to be gathered over multiple iterations rather than being gathered on the first occurrence of the failure or error. After each iteration, a manual determination has to be made if sufficient diagnostic data has been gathered. This process is very time-consuming and also very error-prone due to its manual component. In addition, this process is not an efficient way to gather the required diagnostic data on the first occurrence of a failure. As a result, the time needed to resolve the error is again increased, leading to customer dissatisfaction.
As indicated above, several prior solutions for gathering diagnostic data rely on a human to gather the relevant diagnostic data for a failure, analyze the gathered diagnostic data, and determine if any additional data needs to be collected. For example, a system administrator of a software system may track the failures in the system and determine the diagnostic data to be gathered and sent to the software vendor for diagnostic analysis. Typically, the administrator has to manually decide and generate the diagnostic data that is needed for proper diagnosis of the failure. Gathering a sufficient amount of diagnostic data that is relevant for resolving a particular error usually takes several iterations including many round trips between the administrator and the software support/development organization. This results in a long resolution time for the failure or error. Further, because of the manual component and because system administrators can have different skill levels, the reliability of the data gathering process is not assured and not repeatable.