Computer clustering technologies are used to provide fail-proof operation of computer systems and to complete complex computing tasks. Primarily, the popularity of clusters is explained by the ever-growing number of tasks and by the increasing need to complete them in a 24/7 regime. In the present context, a cluster means a group of physically or logically combined computers, including servers, organized together so that they appear as a single computing resource from the user's perspective.
In order to ensure fail-proof operation, clusters include task re-distribution systems, in which the performance of nodes (computers composing the cluster) is taken into account. In most of today's systems, re-distribution of tasks between the operational nodes of a cluster is handled by task managers. In particular, U.S. Pat. No. 7,661,015 discloses a system for completion of tasks on schedule in a cluster composed of multiple servers. This approach involves a central manager, as well as agents which report the status of the servers to the central manager. Each task is assigned to a specific node; in case of task completion failure, the manager re-assigns the task to another individual computer of the cluster.
U.S. Pat. No. 5,987,621 discloses a mechanism for fail-proof operation of file servers, where a lead server periodically polls other cluster nodes, in order to obtain information on their operability; if an inoperative server is detected, a set of tasks is determined which must be handed over to an operable cluster node in order to be completed. Similar ideas are also described in WIPO Publication No. WO2007/112245A2.
U.S. Pat. No. 4,980,857 discloses a system of work in a distributed computing environment. It proposes a system which includes multiple computer nodes with controllers, devices ensuring fail-proof operation and a task manager. Each node in the proposed solution is assigned a list of tasks it must complete. The status of each node is periodically checked; a node can also send error messages to all cluster members; such messages serve as a basis for determining the parameters for the server under consideration. If any of the servers fails, the task manager re-distributes the tasks between the remaining cluster nodes.
Despite the existing technologies for re-distribution of tasks in case of detection of an inoperative cluster node, there is an issue of ensuring fail-proof operation in case of inoperability of the task distribution manager and in case of absence of an active task distribution manager in a cluster whose nodes have manager functions. For example, U.S. Pat. No. 7,661,015 proposes to track the status of the manager, and, if it is inoperative, to select a new manager from among the operable cluster nodes. In this regard, it should be noted that the manager replacement process can take some time, which is critical when working with a large number of tasks that need to be completed at precisely set times.
Also, during the building of clusters, much attention is paid to the technologies for announcing the inoperability of nodes. In existing systems, for example, in U.S. Pub. No. 2010/0211829 and U.S. Pat. No. 7,451,359, in order to determine operability status, the servers periodically update their status and include a time stamp of the update. However, the use of this approach when working with periodic tasks that have a relatively short completion time, needs enhanced synchronization of the clock inside the cluster, which requires additional resources and costs. In order to resolve this issue, U.S. Pats. No. 5,987,621 and 4,980,857 describe a method where, as time progresses, each cluster node sends signals to other nodes in order to show that it is operable. It should be noted that this solution substantially increases the frequency of communications between the cluster nodes, even with a slight increase in the status update frequency and in the number of cluster nodes; this can cause a delay in signal distribution and, consequently, incorrect interpretation of the cluster node status.