So-called transient errors may occur in running a computer program on computing hardware. Since the structures on semiconductor modules (so-called chips) are becoming progressively smaller, but the clock rates of the signals are becoming progressively higher and the signal voltages are becoming progressively lower, there is an increased incidence of transient errors. Transient errors occur only temporarily, in contrast with permanent errors, and usually disappear spontaneously after a period of time. In transient errors, only individual bits are faulty and there is no permanent damage to the computing hardware. Transient errors may have various causes such as electromagnetic influences, alpha-particles or neutrons.
The emphasis in error handling in communications systems is even presently on transient errors. Conventionally, when an error is detected in communications systems (e.g., in a controller area network, CAN), the erroneously transmitted data are resent. Furthermore, conventionally, the error counter is used in communications systems, the error counter being incremented on detection of an error, decremented when there is a correct transmission, and preventing transmission of data as soon as it exceeds a certain value.
In the case of computing hardware for running computer programs, however, error handling is performed generally only for permanent errors. Taking transient errors into account is limited to incrementing and, if necessary, decrementing an error counter. This counter reading is stored in a memory and may be read out off-line, i.e., as diagnostic or error information during a visit to a repair shop, e.g., in the case of computing hardware designed as a vehicle control unit. Only then is it possible to respond appropriately to the error.
Error handling via error counters thus, on the one hand, does not allow error handling within a short error tolerance time, which is necessary in particular for safety-relevant systems, and also, on the other hand, does not allow constructive error handling in the sense that the computer program is being run again properly within the error tolerance time. Instead, in the related art, the computer program is switched to emergency operation after exceeding a certain value on the error counter. This means that a different part of the computer program is run instead of the part containing the error, and the substitute values determined in this way are used for further computation. The substitute values may be modeled on the basis of other quantities, for example. Alternatively, the results calculated using the part of the computer program containing the error may be discarded as defective and replaced by standard values that are provided for emergency operation for further calculation. The conventional methods for handling a transient error of a computer program running on computing hardware thus do not allow any systematic constructive handling of the transient nature of most errors.
Also, conventionally, transient errors occurring in running a computer program on computing hardware are eliminated by completely restarting the computing hardware. This approach is also not actually satisfactory, because quantities obtained in processing of the computer program to that point are lost and the computing hardware is unable to fulfill its intended function for the duration of the restart. This is unacceptable in the case of safety-relevant systems in particular.
Finally, conventionally, for error handling for transient errors of a computer program run on computing hardware, the computer program may be set back by a few clock pulses and individual machine instructions of the computer program may be repeated. This method is also known as micro-rollback. With the conventional method, the system only returns by objects on a machine level (clock pulses, machine instructions). This requires appropriate hardware support on a machine level, which is associated with a considerable complexity in the area of the computing hardware. It is impossible for the conventional method to be executed exclusively under software control.
The conventional error handling mechanisms are unable to respond in a suitable manner to transient errors occurring in running a computer program on computing hardware.