It is known practice to have two processors execute the same instructions in a lockstep mode and to determine whether an error has occurred by comparing the output data. In this case, the two processors may be operated in clock synchronism or with a certain temporal offset (which is accordingly compensated for during comparison). Both permanent errors, which are caused, for example, by a defect introduced during production, and transient errors, which are caused, for example, by temporary electromagnetic interference, may occur in this case. Program execution is interrupted and, in the simplest case, the computer system is deactivated if a lockstep error occurs, in the case of which the output data from the two processors therefore differ from one another.
However, providing error tolerance, according to which the computer system thus continues to execute the desired program if an error occurs, is a particular challenge for processors with double redundancy. Attempts have been made to assist error tolerance capability in safety platforms with just two redundant processors. U.S. Pat. No. 5,915,082 B2 discloses a system architecture in which internal buses are provided with parity bits which are compared. After a parity error has been detected on one side, the associated processor is disconnected, with the result that it no longer has any influence on the system. The system is switched off after every lockstep error which occurs without a parity error. This procedure which is based on parity checking does not provide sufficient coverage of cases in which the availability of a redundant system is very desirable even after the occurrence of a lockstep error. The parity check can lead, for example, to an incorrect decision if the two internal redundant units simultaneously show different multi-bit errors.
Further known error-tolerant system architectures comprise at least three processor cores with a shared or jointly used memory. In this case, the lockstep mode of the processors is always checked by monitoring bus signals. The lockstep mode is also referred to below as synchronous execution of a program or program parts by the processors.
If the active processor fails, the ownership of the memory area and components which are driven by the active processor via input/output channels passes over to another processor. In the lockstep error state (synchronization error) which follows a lockstep error, data access and control processes are removed from the active processor and maintained by another processor.
The classic minimum configuration for an error-tolerant system, which comprises triple redundancy (TMR: Triple modular redundancy) of processors and a jointly used memory, is still an expensive solution for many safety architectures whose safety concept is based on the use of two redundant processors running in lockstep or synchronously. However, error tolerance is a particular challenge for processors with double redundancy.
U.S. Pat. No. 7,366,948 B2 and US Patent Application Publication 2006/0107106 describe a method for assisting the availability in a system composed of a plurality of processor pairs running in the lockstep mode. Two redundant processors are combined in each pair and their outputs are continuously compared. If an error occurs in one processor pair, another processor pair will assume the driving of the system as a boot processor pair. In the meantime, the processor pair with an error will attempt to recover the synchronization and make itself available as a standby processor pair. This ensures a high level of availability of the system. However, this method is too expensive for many embedded systems since one processor pair is not used when there are no errors and the described method thus provides a poor cost/performance ratio. Four processors which are divided into two pairs and whose output signals are compared in pairs must always be used for a single task. If a lockstep error (LOL: loss of lockstep) or another processor-internal error is detected in a processor pair, the operating system changes the defective processor pair to the quiescent state and activates another processor pair.
EP 1 380 953 B1 defines an error-tolerant computer system with lockstep synchronism, said system containing a multiplicity of computation modules with a processor and a memory, and describes a method for resynchronizing said system. Since each computation module synchronously processes the same instruction string, this computer system is not very efficient.
EP 1 456 720 B1 discloses a computer group for safety-critical applications in motor vehicles comprising two or more control computer systems each comprising two control computers which operate in clock synchronism and have partially or fully redundant peripheral elements and partially or fully redundant memory elements integrated on a chip. The control computers of a control computer system which operate in clock synchronism are connected to an arbitration unit which monitors them for errors and can couple a communication controller assigned to the control computer system to a vehicle data bus or can decouple said controller. If one of the control computers malfunctions, the corresponding control computer system is partially or completely deactivated.
DE 10 2009 000 045 A1 discloses an apparatus for operating a control device which has a computer system comprising two pairs of two execution units each and is used, in particular, in a motor vehicle. The execution units in each pair execute the same program, and the output signals from each execution unit are compared with one another by a respective comparison unit and an error signal is output if a discrepancy occurs. If the error signal occurs for a first pair of execution units, this pair is switched off and the computer system continues to be operated using the second pair of execution units, an advance warning signal being output to the driver.
Systems and methods in accordance with the previously mentioned documents have the disadvantage that a high degree of redundancy must be made available since, when there are no errors, at least one processor pair is inactive or executes the same program as the active processor pair which drives peripheral units. Therefore, each individual processor must provide the entire computation power required, as a result of which the known computer systems do not operate in a very efficient manner. This is undesirable from the point of view of costs, in particular in the case of systems produced in large quantities.
The method described in U.S. Pat. No. 7,366,948 B2 is a very expensive solution for embedded systems. There is also the fact that other components, apart from the processor cores, cannot always be implemented in a redundant manner. Financial reasons typically play an important role when designing safety architectures for different safety-relevant systems, for example brake applications in the automotive sector. Program memories, for example flash memories, are not redundant, but rather are used by all existing processors. Conventional methods do not consider this boundary condition of non-redundant components in the approach to ensuring availability in safety architectures based on redundant processors. Another problem as regards ensuring the availability of processors in safety architectures is that a processor which has previously failed can be started up again only after a safety check has been successfully concluded.
Against this background, there is a need for a safety architecture which has just two redundant processors and which enables a high level of availability of the system. There is also a need for a safety architecture which has three or more processors, for example two processors with two cores each, and which enables a high level of availability of the system.