1. Field of the Invention
The present invention relates generally to computer clusters that include a plurality of computer nodes. More particularly, the present invention relates to a mechanism for collecting state information within the cluster. In this context, state information refers to data that indicates how the resources of a computer node are able to complete their tasks in the cluster. The state information may thus include, not only data indicating the current load of the various resources of a computer node, but also data about the current performance or capacity of the resources in the computer node, i.e. data about the current ability of the resources to complete their tasks in the cluster.
2. Description of the Related Art
As is commonly known, a computer cluster is a group of computers working together to complete one or more tasks. Computer clusters can be used for load balancing, for improved fault tolerance (i.e., for improved availability in case of failures), or for parallel computing, for example.
A typical computer cluster comprises a plurality of computer nodes. A computer node here refers to an entity provided with a dedicated processor, memory, and operating system, as well as with a network interface through which it can communicate with other computer nodes of the cluster. At least one of the computer nodes in the cluster is capable of acting as a manager node that manages the cluster. In order to detect failures in the cluster, the manager node sends certain messages, called heartbeats, periodically to the other computer nodes in the cluster. Typically, only one computer node at a time acts as a manager node.
Control software, residing typically in the manager node, has to monitor all computer nodes that belong to the cluster. In order to get a true and up-to-date picture of the state of the nodes, the control software has to collect state information at a fairly high frequency from the nodes. This is a problem especially in large computer clusters, which may contain tens, or even hundreds of computer nodes. In these large computer clusters the data collection rate has to be compromised in favor of the performance of the network and the computer nodes, to ensure that the network does not become congested due to the data collection and that the performance of the computer nodes remains at an acceptable level despite the data collection performed. In other words, in large clusters the data collection rate has to be compromised in order not to degrade the performance of the network or the computer nodes excessively.
The objective of the present invention is to eliminate or alleviate this drawback.