1. Field of the Invention
The present invention relates to a cluster controlling system for monitoring and controlling a cluster system, and for transferring package programs which have been operating on a computer suffering from a failure to another computer which is within the cluster to execute the programs.
2. Description of the Prior Art
A "cluster" in the conventional art includes both a close-coupled cluster, whose main memory is shared by a CPU, and a loose-coupled cluster, whose data is shared by computers using a LAN or a common disk. The present invention is applied only to the latter type of cluster.
FIG. 50 shows an example of a conventional cluster system. In FIG. 50, a plurality of computers A-N (101a-101n) comprise a cluster. The respective computers are executing cluster daemon A-N (102a-102n), and the respective cluster daemons start-up corresponding package programs A1-N2 (103a1 103n2). The term "package program" is a general term referring to application and service programs.
The respective cluster daemons monitor and control resources (various services or network addresses provided by a CPU, LAN, disk and package programs) on the executing computers, and store the data on the respective computers as local data A-N (104a-104n).
The operation of the cluster system is explained using FIG. 51. When a resource A (2401a) required by computer A (101a) is lost, the cluster daemon A (102a) stops the computer A (101a). The cluster daemon N (102n) on another computer N (101n) detects the stop of the computer A (101a), and the computer N (101n) executes the package program A1 (103a), which had been executed by the computer A (101a).
Thus, the specific package program is executed on any one of the computers within the cluster. When a user utilizes services provided by the package program by assigning a network address for every package program, it is not necessary for the user to know exactly which computer in the cluster is executing the package program.
Exemplary systems for concentratedly monitoring and controlling distributed resources are disclosed in references such as Japanese Laid-open Patent publication No. 5-75628, "Network resource monitoring system", Japanese Laid open Patent publication No. 5-134902, "Information Managing System In A Distributed Computing System", and Japanese Laid-open Patent publication No. 6-223020, "Network Management System And Managing Method Of Objects".
These systems are achieved by using managing computers, or by incorporating a managing process (manager). However, none of the above references suggest a solution for when a failure occurs in the managing computers or the managing process.
Since conventional cluster systems are constructed in the manner explained above, when making a program for monitoring and controlling the entire system, it has been necessary to communicate with all computers in the cluster since the data is distributed to all computers. Therefore, there have been some difficulties in making such programs.
In addition, in systems which concentratedly monitor and control distributed resources, the monitoring function completely stops when a fault or failure occurs in either the computer itself or in the process for monitoring and controlling the entire system. Moreover, because the interrelationship, or priority order among various package programs could not be defined, it has been difficult to transfer the data from other multiplexed systems.
In addition, it takes a long time to restart and recover the package. Moreover, since switching and the processing of packages takes a long time after the system has been recovered, a package to be processed by a parallel operation cannot be processed well. Therefore, the performance of the system deteriorates after recovery.