Modern companies have already implemented a large number of services, communication links, monitoring tasks etc. using digital computers today. By way of example, the ordering of goods over the internet is beating down the, until recently, customary mail ordering more and more.
Such an order process involves the customer using his Internet-connected computer to dial up a server in the providing company in order to use the order software available there for his order. During the order process, the customer does not notice how many different computers are simultaneously or successively handling his order process; as long as a fault does not occur during the order process, the customer sees the order situation as though he were communicating with just one computer as his “contact”.
If a step in the order process fails, however, then the customer frequently notices this because he needs to reenter information which has already been entered, since information is lost as a result of a fault in any one of the computers in the order system. Such order systems which can be used over the internet are known and are used every day by millions of users.
A drawback of such systems is that, even though they normally include a plurality of computers, failure of one or these computers results in failure of the entire computer system or at least in a loss of a subfunction. Thus, it results in the loss of information and processing time. The reason for this drawback is that the use of such computer arrangements (clusters) essentially achieves the object of distributing demands based on the computer system over a plurality of computers (distribution of load), in order to increase the speed and the number of simultaneously processed operations. On account of the fact that such arrangements involve the demands to be processed not being routed to a plurality of computers simultaneously on account of the desired distribution of load, and the computers in this arrangement not being synchronized, failure of one computer in the arrangement inevitably results at least in a loss of a subfunction and/or in the loss of information.
A computer arrangement containing a plurality of servers is specified in EP 0 942 363 A2, for example. In this case, incoming request data are divided into service classes which are then each processed by a particular number of servers. If a particular service now cannot be processed because the currently available computer capacity resources are not adequate, then servers are detached from other service classes which still have computer resources available and are allocated to the requested service.
The European laid-open specification thus describes a computer cluster in which the request data have their load distributed over the servers. Thus, if there is a resource bottleneck for a service, a server from another service which still has free computation capacity engages.
One drawback in this context is that no solution is provided for the fault scenario. Thus, although failure of a service does not entail the loss of the service in question overall, there is no assurance that the request data transferred to the computer cluster will be maintained in the fault scenario and will be able to be processed further with as few interruptions as possible.
Such computer arrangements are therefore not suitable for critical applications in which no data loss and/or no processing delay must occur in order to avoid any risk to humans and the environment. It is therefore not possible to use such arrangements as, by way of example, a monitoring system in nuclear power plants, as a protection system for dangerous, for example electrical or chemical processes, or as a control system for time-critical procedures.
DE 198 14 096 A1 describes a method for changing over redundantly connected assemblies of the same type. Of these assemblies of the same type, one acts as a master assembly which serves an automation process. A second assembly of the same type is in the “slave mode” (reserve), in order to be able to adopt the function of the master assembly in the event of a fault therein.
Those assemblies of the same type are synchronously provided with the same request data by a superordinate device. In the event of a fault in the master assembly, the assembly in slave mode is activated directly, bypassing the superordinate device, in order to adopt the functionality of the master assembly. This ensures that a faulty assembly is rapidly changed over to an operational assembly in the event of a fault.
However, it is not possible to identify how, in the event of a fault, it is possible to ensure that no request data are lost and that the assembly adopting the function in the event of a fault delivers correct output data.
Another drawback with this method from the prior art is that the assemblies need to be of the same type. This prevents the use of different assemblies having the same function to solve the problem, which results in high costs when implementing such a redundant arrangement. By way of example, it would be possible to have the main computer (master) in the form of a very powerful computer and to have the reserve computer (slave) as a somewhat less powerful computer. Normally, the powerful computer would perform a function of the computer arrangement, and slight losses in computation power would arise only in the event of a fault (when the reserve computer adopts the functionality); such a computer arrangement, which is more cost-effective as compared with the cited prior art, cannot be operated in a fault-tolerant manner with the method described, however.
WO 98/44416 describes a fault-tolerant computer system. This includes, by way of example, four or more CPUs which operate in clock synchronism. Incoming data are processed in clock synchronism by all the CPUs simultaneously. The CPUs transmit their computation results to an evaluation unit which ascertains the validity of these results and outputs a valid result.
In this system, the fault tolerance is implemented virtually exclusively in hardware. Thus, the units (CPUs) which are entirely similar to one another process the same input data absolutely simultaneously (clock synchronously) and deliver an associated result. Failure of one unit thus does not result in failure of the entire system.
A drawback in this context is that such clock synchronously operating solutions are very costly, since clock synchronous operation makes great demands on the hardware used, which additionally needs to be of entirely the same type throughout; tolerances are virtually not permissible in this context. In addition, synchronizing the units used is very complex, since the parallel-connected units can never run one clock cycle apart when processing the request data. In addition, it is not possible to use hardware of a different type throughout in order to implement the redundancy based on this prior art.
Other examples from the prior art for such redundant systems implementing the redundancy exclusively in hardware are the “H systems” (high availability systems) in the SIMATIC automation family from Siemens (e.g. S5-155H; S7-400H). In this case, two respective entirely identical, special central processing units are used which each process the same request data clock synchronously in parallel. The synchronization and monitoring for failure of the central processing units are very complex; in addition, the procurement costs are very high.