Basically, there are any number of different types of error which lead to a machine or a system of computers having to be started up. Since these errors have proven to be persistent, startup represents the only opportunity of rectifying it. This type of reset of a system usually requires the use of manual commands. If account is taken of the time needed to start a system manually, in particular using interactive inputs, it becomes clear that this is not viable for systems that demand high availability.
There have been efforts to automate a restart of a system. U.S. Pat. No. 5,708,776 shows, for example, a procedure for automatic restoration of the state before the error for network applications. This involved making available a first and a second reboot partition. If booting from the first partition proves not to be successful, the system will be booted from the second partition. A monitoring processor executes the software for the automatic recovery after an error in the operating system software or application software has been discovered. This document however contains no information about starting up a cluster system after a reboot. Starting up a cluster system means taking account of significantly more and more complex interdependencies of the individual nodes which is generally controlled by a cluster controller.
For systems that require maximum availability, such as carrier-grade systems in the telecommunications area of systems in healthcare and the financial sector, high-availability computer architectures have been introduced which are designed to provide maximum fault tolerance. The tests to be executed for this should be able to be executed around the clock and without interruption.
Cluster systems in particular are used for this purpose. This term includes different types of systems in which a number of autonomous machines are each networked with redundant resources and for which the use is controlled by a cluster controller.
A distinction is made between active-passive and active-active cluster architectures. With the active-passive clusters virtual pairs of machines all servers are formed in each case, in which case one server is active and offers the relevant service or executes the relevant software. If there are no errors here the other server basically operates in standby mode and takes over as quickly as possible if an error occurs.
With active-active clusters each server within the cluster takes over one task and both will operate actively in parallel. Depending on the system configurations and the intact server takes over all the tasks of the defective server if an error occurs. A better load distribution can be achieved with the active-active concept than with the active-passive architecture.
Regardless of the relevant architecture, in cluster systems a server that is still operable takes over the jobs of the defective server if an error occurs. This process is referred to as fail-over.
As well as the computer hardware, the external memory system must also be adapted to the cluster system for high-availability systems. For example, data can be stored redundantly on distributed storage to increase the security of the system. What is known as the RAID-1 System (Redundant Array of Inexpensive Disks) employs a redundancy concept that is based on the mirroring of data sets.
An important factor for all cluster systems is that they are based on an “intelligent” controller, co-ordination and communication between the individual cluster processors. The transfer protocols that are being used must be defined for example, how the individual processes to be distributed communicate with each other or by which criteria a fail-over is controlled. Another important point is that the integrity of the cluster is maintained. It must thus be guaranteed that even after a reboot of the system consistent data records are present at all nodes.
If an error now occurs in a cluster system which, although it can be rectified, is so serious that a node has to be rebooted, it was previously necessary after the reboot of the node to startup the cluster manually by entering commands.
JP 14 87 4 A2 shows, in this context, a method for maintaining operation of the cluster in which an error in a memory area of the cluster has been detected. In this case, a system controller that is set up on each node gives information about an error and directs this error message to a central point so that errors in nodes can be prevented from leading to downtime of the cluster. However, no information is given about how a cluster after a reboot caused by a wide variety the errors can automatically be restarted. Here too, a manual startup of the cluster after a reboot is necessary.
This manual process is however not viable for high-availability clusters because of the increased downtimes that it causes.