1. Field of the Invention
The present invention relates to a technology for controlling an allocation of a job to each node in a cluster system including a plurality of nodes mutually connected by a high-speed network.
2. Description of the Related Art
Recently, in the field of high performance computing, personal computers (PCs) that utilize inexpensive commodity hardware are tending to be used instead of supercomputers as calculators. Since performance is limited on a single PC, high performance is realized by mutually connecting many PCs via a special-purpose high-speed interconnect to form a cluster.
In large-scale computing using such a cluster, while parallel applications over long periods must be operated while communicating by nodes, when a failure occurs in the cluster, even if the failure occurs in just one node, in a worst case computation results at all nodes in the cluster can be lost. To minimize such loss of computation results, checkpoint data is output while executing parallel applications and stored regularly. Based on this checkpoint data, when the failure occurs, a computation can be restored from the last checkpoint that has been stored. This function of regularly storing checkpoint data is a minimum requirement regardless of the scale of the cluster.
Generally, methods for storing data such as checkpoint data in a cluster system broadly divide into shared disk methods of storing a batch of data in network storage such as a storage area network (SAN) and a network attached storage (NAS), and mirror disk methods of using a disk device provided at the nodes and making data redundant between many nodes. There is a tradeoff between the scale of data these methods handle and their installation costs.
While the shared disk method is suitable for handling large-scale data, there is a disadvantage that, since simultaneous write operations to the shared disk are often generated together from jobs executed in series applications, high reliability is demanded of the shared disk to withstand such enormous disk accesses, and this increases the installation cost.
While the mirror disk method can be configured inexpensively with providing an expensive shared disk such as a SAN, and is already widely used in small-scale clusters of approximately two web servers of the like, there is a disadvantage that, since a network load is applied when making the data redundant, this method is not suitable for systems that handle large-scale data.
Mirror disk methods can be divided more precisely into two types. There are active/standby type methods in which, when making redundant data between two nodes, one node is deemed an active node used in executing normal applications and another is made a standby node used only in storing data executed by the active node (see, for example, Japanese Patent Application Laid-open No. 2002-123406), and there are active/active type methods in which the two nodes both include active and standby functions and make their data mutually redundant.
Among the active/active type methods, mirror disk type clusters arranged such that both nodes execute different applications (hereinafter, “mutual standby type”) have an advantage in that, unlike from mirror disk type clusters of other arrangements, each node can be used efficiently. If a cluster system using this mutual standby type can be applied in a large-scale cluster system configured from many nodes, the scale of the cluster system can be expanded with an inexpensive configuration.
However, the mirror disk type cluster technique including the mutual standby type assumes a small-scale cluster system of approximately two to four nodes, and is problematic in not considering a determination reference for freely changing the partner destination node when data is made redundant. When a partner destination node for making data redundant is randomly determined, this increases network band consumption and increases overhead caused by differences in processing speeds between nodes, whereby the processing capability of the entire cluster deteriorates. Consequently, it becomes impossible to obtain a computation processing capability that matches the scale of the cluster.