A computer system manages access to multiple resources of the system, such as a CPU, memory, a storage device (referred to hereinafter as a disk), and a network. There are many computer monitoring tools available that monitor such resources. Monitoring tools gather information about the availability (or lack thereof) of resources and typically report such information to users or administrators. However, existing monitoring tools suffer from significant drawbacks. One of those drawbacks is described in the context of Oracle Corporation's Real Application Cluster (“RAC”).
A RAC comprises a single database that is shared by multiple instances of a database server (referred to as database instances). In such a configuration, each separate database instance reads data from and writes data to the same disk space, but each database instance maintains its own separate shared memory, which is only available to the processes of the corresponding database instance.
Currently, a RAC database instance may be evicted from a cluster because the database instance is not responding to other databases instances in the cluster either through network messaging or disk I/O. A possible reason for why this may happen is that the evicted instance has a relatively high CPU usage level. If CPU usage is relatively high, then a monitoring tool is unable to obtain the CPU in order to determine that the disk and network are not responding to other instances in the cluster. After the database instance is evicted from the cluster, there is not enough information about the machine on which the evicted instance is running because current monitoring tools (either inside or outside the database) are unable to capture the data during the period that the CPU is maxed out (e.g., 99% usage). Eviction of a database instance may also occur when other resources of the corresponding machine are heavily utilized or unresponsive, such as disk I/O, network I/O, and memory. Without the necessary information, an administrator of the cluster is unable to quickly and accurately determine why the database instance was evicted.
Based on the foregoing, there is a need to provide a computer system monitoring tool that will report exceptional events before resources of the computer system become unavailable or unresponsive. The monitoring tool should also provide an accurate description of the state of the computer system so that an analysis of the gathered statistics will yield the reason(s) why the computer system failed or became (at least temporarily) unresponsive.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.