The present invention relates to system maintenance and diagnosis, and more particularly to techniques for gathering, organizing, and storing diagnostic data related to a monitored system.
Diagnosing defects in systems, such as Oracle database (DB) products, can be a complex and time-consuming task. In a complex software environment, the diagnostic data required to resolve an issue or problem can come from different sources and may be stored in multiple locations and in various different formats. For example, for a system comprising multiple components, the state of the various components may be held in different log files, diagnostic traces corresponding to the components, etc. The information stored in different log files and diagnostic traces may be stored in different formats. The diagnostic data may be stored in different locations such as in different repositories.
In a typical diagnostic flow, diagnostic data captured at system site (e.g., a customer site executing one or more product instances) is communicated to a diagnosis site (e.g., the site of the product vendor) for failure analysis. At the diagnosis site, the data received from the system site is analyzed to determine, for example, occurrence of an error in the system, a root cause of the error, recommendations for mitigating effects of the errors, repair solutions to fix the error, and the like. The results of the analysis may be communicated from the diagnosis site to the system site.
Due to the sheer amount of diagnostic data that may be captured for a monitored system and the often disorganized manner in which the data is gathered and stored at the monitored system site, it is often a difficult to establish what diagnostic data is available at for the monitored system and where the data is stored. Further, it is also very difficult and time consuming to identify what pieces of diagnostic data need to be submitted to the vendor for analysis. If too little information is provided to the vendor, the amount of submitted data may be insufficient to perform a proper diagnosis of the error. In such a case, the vendor then has to often contact the customer again and request additional information, some of which might no longer be available. Further analysis is possible only after receiving the additional requested information. This may take several back-and-forth trips between the customer and vendor before the error can be diagnosed. On the other hand, sending too much diagnostic data to the vendor is also problematic. The amount of diagnostic data that is collected for a monitored system may include thousands of files and many gigabytes of data. Sending such a large volume of data to the diagnosis site is cumbersome, time consuming, and expensive. Some of the data to be sent in this case may also contain confidential information that may be hard for the sending site to determine and eliminate. Further, if the data received at a diagnosis site is very large, it takes the vendor a long time to analyze the received diagnostic data to identify relevant pieces of data within that data for analyzing the root cause of the problem. Accordingly, under either scenario, developers at the vendor's diagnosis site cannot locate relevant diagnostic information in a timely manner. As a result, the time needed to resolve the issue or problem is increased, leading to customer dissatisfaction.
Further, conventional systems also lack the ability to correlate problems occurring up-stream and/or down-stream in the product stack or across different product instances that may be useful for diagnosis of the problem that caused the error.