This invention relates to a fail-over method to be employed for a computer system provided with servers to be booted up from an external disk drive, more specifically to a method for shortening the time of the fail-over execution.
In a computer system in which its servers are started up from an external disk array device, the disk array device is usually connected to a plurality of servers through a fiber channel and/or a fiber channel switch. And a boot disk used for such servers is provided in an area of the disk array device. This is why the boot disk used by a given server in the computer system can be referred to from another server in the computer system.
In such a computer system provided with a plurality of servers connected to an external disk drive through a network and booting up an operating system from the external disk drive respectively, if an error occurs in an active server that is operating an application job, the job is taken over by a backup server that is not operating the job at that time. There is a method for enabling such a taking-over processing between servers (e.g., US Patent 2006/0143498A1). According to the method, in such a taking-over time, the system detects an error occurred in an active server, then searches another server having the same configuration as the active one and not operating the subject job, then enables the searched server to access the external disk drive and to be started up from the disk drive, thereby the job is taken over to the searched server. Such a processing for enabling a backup server to take over a job from the error-occurred server is referred to as a fail-over processing.
Such a fail-over processing for starting up a backup server upon error occurrence in an active server is also referred to as cold standby.
On the other hand, there is a method for reducing the cost of introducing computer systems. According to the method, a single server is divided into a plurality of logical partitions that function as a plurality of servers respectively. For example, a set of a CPU, a memory, an I/O device, etc. are divided into a plurality of partitions and each of the partitions is allocated to a logical partition, thereby each of the logical partitions functions as a server.
And those techniques as described above can be combined to further reduce the introducing cost of a computer system enabled to execute a plurality of applications.
In the case of the above described cold standby, after an error is detected in am active server, its backup server is started up. Thus a time lag is generated unavoidably between starting-up of the backup server and restarting of the taken-over application job.
A backup server, after it is started up, requires some steps before restarting a taken-over application job. Concretely, when the server is started up, its hardware is initialized first. And in the hardware initialization, there will be executed such processes as initialization of I/O devices such as a network interface and a host bus adapter for connecting each server to external units and devices, then the memory is checked and initialized. After that, the CPU is initialized. The server then starts up its operating system. Finally, the server starts up the subject application job program to restart the job.
Particularly, the hardware initialization time increases in proportion to an increase of the number of I/O devices and the memory capacity. And in recent years, there is a trend that the number of CPUs and memory capacity are increasing. Therefore, the check and initialization times of those CPUs and memory are also increasing.
Under such circumstances, in such a fail-over processing for starting up a backup server that takes over an application job like a cold standby processing, it has been intended to solve a problem of the increasing time required from error occurrence to restarting of the taken-over job.