Multi-host Computing System
A multi-host computing system is a collection of interconnected computing elements that provide processing to a set of client applications. Each of the computing elements may be referred to as a node or a host. (The word “host” is used herein to avoid confusion with “nodes” of a graph). A host may be a computer interconnected to other computers, or a server blade interconnected to other server blades in a grid. A group of hosts in a multi-host computing system that have shared access to storage (e.g., have shared disk access to a set of disk drives or non-volatile storage) and that are connected via interconnects may be referred to as a cluster.
FIG. 1 is a diagram of an example multi-host computing system that includes four hosts 110a, 110b, 110c, and 110d. These hosts may communicate with each other via the Network 130. The hosts access Disk Bank 140 through the network. Disk Bank 140 includes disks that may provide Swap Space 142. A host, such as Host 110a, includes at least a Processor (CPU) 114 and Memory 112. At least part of an Operating System Kernel 116 may reside in Memory 112 and implement system and user processes 118. A process may be a running instance of software, for example, a process that runs database management software.
A multi-host computing system may be used to host clustered servers. A server is combination of integrated software components and an allocation of computational resources, such as memory, a host, and processes on the host for executing the integrated software components on a processor, where the combination of the software and computational resources are dedicated to providing a particular type of function on behalf of clients of the server. An example of a server is a database server. Among other functions of database management, a database server governs and facilitates access to a particular database, processing requests by clients to access the database.
Resources from multiple hosts in a cluster can be allocated to running a server's software. Each allocation of the resources of a particular host for the server is referred to herein as a “server instance” or “instance.” A database server can be clustered, where the server instances may be collectively referred to as a cluster. Each instance of a database server facilitates access to the same database, in which the integrity of the data is managed by a global lock manager.
Each host of cluster is comprised of multiple components that are interdependent for the purpose of performing the work of the cluster. In addition, hosts in a cluster cooperate with each other to perform global functions such as time synchronization, lock management, and file system management. Thus, a failure in one component on one host may adversely affect other components on that host from carrying out their function and/or may adversely affect another host's ability to carry out its work within the cluster.
Root Cause Diagnosis
As can be seen from the description above, multi-host systems may be very complex with interdependencies among multiple hardware and software components. When the system fails, it may be difficult to determine the root cause of the failure. From a set of observations about the system, the cause of a problem may be determined so that the underlying cause may be fixed.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.