In recent years, there has been an increasing demand for a computer system used as an information processing system that connects many computers (referred to as “calculation nodes” hereinafter) as information processing apparatuses to a network and makes the plurality of calculation nodes parallelly perform calculation processes (referred to as “jobs” hereinafter) in a distributed manner. In a computer system of this type, a management computer (referred to as a “management node” hereinafter) is used for managing and controlling the hardware of each calculation node and jobs to be processed by a calculation node group.
In a large-scale computer system, it is difficult to make a single management node manage all the computers because such a configuration would prolong a time taken to perform such management, deteriorating the processing performance. Accordingly, large-scale computer systems usually employ distributed processing by using a plurality of management nodes.
FIG. 1 explains a management method of a computer system that uses conventional management nodes. In FIG. 1, calculation nodes are denoted by “1” while management nodes are denoted by “2 (2-1, 2-2)”.
As illustrated in FIG. 1, in a case of a large-scale computer system, the management nodes 2 are hierarchized into a tree structure. In FIG. 1, the management node 2-1 is the management node 2 serving at the top level in the tree structure and controls the management nodes 2-2, which are at a lower level than the management node 2-1. The management nodes 2-2 control management nodes 2-3, which are at a lower level than the management nodes 2-2. Each management node 2-3 manages calculation nodes 1 that are its management target. The management nodes 2 (2-2 and 2-3) that are controlled by the management nodes 2 (2-1 or 2-2) of higher levels will be referred to as “management sub nodes” hereinafter. The management node 2-1, which is at the top level, will be referred to as the “top-level management node” hereinafter.
In the hierarchization of the management nodes 2 as illustrated in FIG. 1, the management nodes 2 that control the lower-level management sub nodes 2 indirectly manage the calculation nodes 1 via the respective management sub nodes 2. Accordingly, the hierarchical relationships between the management nodes 2 correspond to the inclusive relationships between the management nodes 2 that control the lower-level management sub nodes 2 and the group of the calculation nodes 1 that are directly or indirectly managed by the management nodes 2. Also, the hierarchical relationships correspond to paths used for distributing an instruction message from the top-level management node 2-1 to the respective calculation nodes 1 and paths used for transmitting information from the respective calculation nodes 1 to the top-level management node 2-1.
In a computer system as illustrated in FIG. 1, when one of the plurality of management nodes 2 has failed, some calculation nodes 1 become unable to be managed depending upon the management node 2 that has failed. Accordingly, computer systems need to be made redundant in order to increase reliability.
The hierarchical relationships between the management nodes 2 illustrated in FIG. 1 are in a fixed tree structure, and each management node 2 needs to perform a task that is specific to the position in the hierarchical relationships (tree structure). As a general rule, when the reliability of a system is to be increased, consideration is given to two conditions; (a) maintaining the function of the system and (b) preserving data being processed. Partly because these two conditions need to be taken into consideration, when the hierarchical relationships of the management nodes 2 as illustrated in FIG. 1 exist, the process of making the management nodes 2 redundant needs to be conducted for each of the management nodes 2.
As described above, in a computer system, jobs that are processed by a group of the calculation nodes 1 are managed and controlled. Data of condition (b) includes information related to the management or control of jobs. When such information has been lost, the loss has a great negative influence on the operation and management of the entire computer system. Accordingly, condition (b) is also very important.
In order to increase the reliability of a computer system, robustness against multiple failures is needed. When a tree structure (hierarchical relationship) of management nodes as illustrated in FIG. 1 is employed, the multiplicity of each management node needs to be (1+k) times higher in order to attain the robustness against k-fold failures (k is an integer equal to or greater than one).
Today, robustness against multiple failures has been realized in large-scale computer systems as well. Such robustness is realized by preparing a computer used for preserving data in each node (vertex or nodal point) in a tree structure and by saving data for a plurality of nodes to that computer.
This realization method can reduce the number of computers for replacing the management nodes 2 that have failed. However, communications between nodes have to be conducted in order to save data. These communications cause a delay. This delay hinders rapid execution of control of a group of the calculation nodes 1, responses to failures in the management nodes 2, etc. Accordingly, it is difficult to employ this realization method for a computer system that is subject to a strict limitation regarding a communication delay time.
As described above, each of the management nodes 2 needs to perform a task specific to the position in the hierarchical relationships (tree structure). Because of this, a realization method may be possible in which the management node 2 for replacement is prepared for each node in a tree structure and each of the management nodes 2 for replacement preserves data. This realization method can be used for a computer system that is subject to a strict limitation regarding a communication delay time because saved data does not need to be accessed.
However, when the total number of nodes (vertex or nodal point) representing the management nodes 2 is M in the tree structure and the multiplicity of the load distribution in a node in this tree structure is m(p), the total number of nodes needed to attain the robustness against k-fold failures is a value resulting from the calculation of M×(1+k). Because the number of nodes needed for an arbitrary node is obtained from m (p)×(1+k), M=Σm(p) is satisfied.
In this realization method, it is needed to prepare a number of the management nodes 2 for replacement in accordance with the number of nodes and the value of k that is assumed. In a large-scale computer system, the number of nodes is very large. Accordingly, a very large number of the management nodes 2 for replacement have to be prepared, leading to immense resources in a backup system for the redundancy. The existence of a backup system having immense resources increases the construction costs of the computer system and the operational costs as well. In view of this, it is also important to suppress the resources of a backup system when robustness against multiple failures is to be attained.
Patent Document 1: Japanese Laid-open Patent Publication No. 2008-153735
Patent Document 2: Japanese Laid-open Patent Publication No. 10-21103