A complex comprising a plurality of network computers which jointly perform a set task is called a cluster. In this context, the task to be performed is broken down into small task elements and these are distributed over the individual computers. A known type of cluster is Biowulf clusters, which are used particularly for tasks which involve a large amount of computation. In another form of cluster, it is not the computation speed but rather the availability of the cluster which is in the foreground. With this form of cluster, it is necessary to ensure that if one computer within the cluster fails then the other computers undertake the tasks of the failed computer with no or with just little time loss if at all possible. Examples of such clusters are web servers within the Internet or else applications with central data storage using a relational database.
Clusters which operate in this manner are also called high-availability clusters and have a plurality of individual servers which are connected to one another via a network. Each server forms a node in the cluster. Servers which handle applications are called application nodes, and servers with central management, control or inspection tasks form inspection nodes. On the application nodes, various applications or various application elements in a large application are executed, with the individual applications being able to be connected to one another. Further computers outside the cluster, called clients, access the applications running within the cluster and retrieve data.
Besides the application node, such a cluster contains the inspection node, which is a central entity. The inspection node monitors the applications running on the individual application nodes, terminates them if appropriate or restarts them. If an application node fails, the central entity restarts the failed applications on the other application nodes. To this end, it selects a node which still has sufficient capacity. Depending on the configuration and utilization level of the cluster, this involves the use of an application node which has not been used to date or the computation load of the applications which are to be restarted is distributed as evenly as possible, an operation which is called load balancing.
To protect the central entity or the inspection nodes, for their part, against failure, it is necessary to provide them in redundant form, usually using further servers which mirror the central entity. However, such a cluster solution has the drawback that the data interchange between application nodes and the central entity is very great. In addition, each application node uses up computation time to respond to the requests from the central entity. Since the central entity also needs to be able to handle every possible failure scenario, the configuration complexity and the associated risk of an incorrect configuration rise considerably.