1. Field
The embodiments discussed herein are directed to a cluster control apparatus, a control system, a control method, and a control program for disconnecting a server that is unable to execute a job to be processed.
2. Description of the Related Art
A cluster system that links a plurality of computers together to make them look like a single computer system has been used as a scientific and technical processing system or a corporate core system. The computers (hereinafter, referred to as “nodes”) on the cluster system can be broadly classified into two types: compute nodes for performing computations and management nodes for managing all of the compute nodes.
In the cluster system, a single computation (hereinafter, this single computation is referred to as a “job”) can be performed on a plurality of compute nodes or a plurality of computations are simultaneously performed on a plurality of compute nodes. Hereinafter, a job that is executed in parallel over a plurality of compute nodes is referred to as a “parallel job” and a single job executed by a single compute node is referred to as a “serial job”.
FIG. 1 illustrates a case in which a job (parallel job) is executed on a plurality of compute nodes. When there are three compute nodes in a cluster system and a parallel job is executed using all of the compute nodes, if one compute node is stopped due to abnormality, etc., only two compute nodes are left. That is, in the cluster system, a job that is defined to need three compute nodes cannot be executed and even if the job can be executed by two compute nodes, the processing speed thereof decreases.
For such an occasion, a backup compute node may be prepared. A backup compute node can be manually or automatically prepared. In the automatic case, the operation thereof varies depending on software implementation.
Conventionally, an extra compute node can be installed from the beginning and the extra compute node is not used at normal operation as a first installation method. An operation method for job execution can be determined in advance and a compute node, though it is installed, is not used. As illustrated in FIGS. 2A and 2B, the configuration is such that although there are four compute nodes on a cluster system, only three of then are used for a job (see FIG. 2A). By such a configuration, although in normal times one compute node is not used, even if one of the compute nodes being used is stopped, a three-compute-node configuration is maintained, and thus, the job can be executed (see FIG. 2B).
Conventionally, a compute node can be stopped and not to be used as a second installation method. Specifically, as illustrated in FIGS. 3A and 3B, although there are four compute nodes on a cluster system, one of then is made to be stopped (see FIG. 3A). When a management node detects that one of the three compute nodes being in operation breaks down, the management node turns on power to the compute node being stopped, to install the compute node for the operation (see FIG. 3B). By this, even if one compute node being used is stopped, a three-compute-node configuration is maintained, and thus, the job operation is maintained.
Conventionally, as illustrated in FIGS. 4A and 4B, an operation can be performed by different compute node groups (hereinafter, referred to as “job execution groups”) having different application purposes (see FIG. 4A), if a compute node in one job execution group is stopped, then a compute node in the other job execution group is used in a shared manner (see FIG. 4B) as a third installation method. In this method, the number of nodes needed for the job operation can be maintained.
Conventionally, a flexible cluster system is disclosed in which the roles of server apparatuses respectively belonging to clusters can be flexibly changed. An alternate processing scheme is disclosed in which in a tightly coupled multiprocessor system that includes a plurality of PGs (processor groups) and executes jobs by specifying the respective PGs, when a job that specifies a PG unable to execute a job is alternatively processed by another PG, the loads of the respective PGs are equalized. A mechanism is disclosed in which when performing booting or rebooting due to the occurrence of some kind of abnormality during normal OS operation, as a master processor used to start up a system, an abnormal processor is not selected.