The present invention relates to system maintenance and diagnosis, and more particularly to techniques for controlling collection of diagnostic data in a monitored system.
When a system encounters a failure or error condition, diagnostic data is typically collected and stored to a persistent storage (also known as dumped to a persistent storage) for diagnostic analysis. The diagnostic data may be communicated from the system site to a diagnosis site (e.g., a vendor site) for analysis and resolution of the error. The amount of diagnostic data that is captured varies from one system to another. Diagnostic data is generally collected and stored for each occurrence of an error condition. As a result, a repeatable failure or error condition in the system such as a corrupted storage media, a hardware failure, or other bugs in the system may cause the system to collect large amounts of diagnostic data in a relatively short period of time constituting a flood of diagnostic data. Such a flood of diagnostic data may adversely impact available system performance, resources, and reliability. For example, the amount of diagnostic data that is collected may include thousands of files and many gigabytes of data. Sending such a large volume of data to the diagnosis site is cumbersome, time consuming, and expensive. Further, the larger the size of the diagnostic data, the longer it takes to analyze the diagnostic data to identify relevant pieces of data for analyzing a particular problem.
Existing solutions try to limit the amount of diagnostic data that is collected in a monitored system by imposing a limit on the storage space that is available for storing the diagnostic data. These techniques generally stop all diagnostic data gathering when a predefined storage limit is reached or exceeded. Due to the complete stoppage of diagnostic data gathering when such a predefined storage limit is reached, diagnostic data that is relevant for diagnosis may not be gathered. Further, diagnostic data may not be gathered for future error conditions. As a result, using existing solutions, the diagnostic data that is gathered for the monitored system may not be sufficient to diagnose the failure or error condition completely, increasing the time-to-resolution in solving system failures.