“Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a user, the nodes in a cluster appear collectively as a single computer, or entity.
Clustering is often used in relatively large multi-user computer systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
Clusters typically handle computer tasks through the performance of “jobs” or “processes” within individual nodes. In some instances, jobs being performed by different nodes cooperate with one another to handle a computer task. Such cooperative jobs are typically capable of communicating with one another, and are typically managed in a cluster using a logical entity known as a “group.” A group is typically assigned some form of identifier, and each job in the group is tagged with that identifier to indicate its membership in the group.
Member jobs in a group typically communicate with one another using an ordered message-based scheme, where the specific ordering of messages sent between group members is maintained so that every member sees messages sent by other members in the same order as every other member, thus ensuring synchronization between nodes. Requests for operations to be performed by the members of a group are often referred to as “protocols,” and it is typically through the use of one or more protocols that tasks are cooperatively performed by the members of a group.
In many clustered systems, joint operations are implemented using a “peer”-type protocol, where all members receive a message and each member is required to locally determine how to process the protocol and return an acknowledgment indicating whether the message was successfully processed by that member.
Typically, with a peer protocol, members are prohibited from proceeding on with other work until acknowledgments from all members have been received. Moreover, each member is required to send an acknowledgment (ACK) message to every other member when its local processing of a protocol is complete.
Since each member is required to wait for ACK messages from every other member before completing a protocol, whenever one member doesn't promptly send out ACK messages (e.g., due to being locked up, or “hung”), all members participating in the same protocol are effectively stalled while waiting for the ACK messages from the non-responding member. Also, even if a particular member is not completely hung, but is simply slow in responding (e.g., due to delays in obtaining a local resource for the member, a lack of adequate CPU time due to the presence of other, higher priority jobs on the member's node, comparatively lower hardware performance on the member's node, network delays, etc.), all members will likewise appear to be slow as well.
Given that a problem in a particular member that degrades the responsiveness of that member often has similar effects on other group members, diagnosing and correcting a problem in a clustering environment are often hindered by the inherent difficulty associated with identifying which member in a problematic group is the root cause of the problem.
To identify problematic members, many conventional clustering environments traditionally require that members be arbitrarily chosen and analyzed to determine their respective operational status. For example, many conventional environments support the ability to dump a local call stack for a particular member through a local debugging operation performed on that member.
However, diagnosis through the arbitrary selection of members is rarely an efficient way of locating a problematic member, since on average for a group of N members, N/2 of the members would need to be analyzed to locate the problematic member. Moreover, in the worst case scenario, all N members would need to be analyzed. The difficulty in identifying a problematic member is further exacerbated in situations where there are a relatively large number of members, as well as when those members are geographically dispersed (e.g., connected over a wide area network (WAN)).
Therefore, a significant need exists in the art for an improved manner of diagnosing faults in a clustered computer system, particularly to detect problematic members that are inhibiting the progress of peer protocols executing on multiple members in the system.