So-called transient errors may occur in running a computer program on computing hardware. Since the structures on semiconductor modules (so-called chips) are becoming progressively smaller, but the clock rates of the signals are becoming progressively higher and the signal voltages are becoming progressively lower, there is an increased incidence of transient errors. Transient errors occur only temporarily, in contrast with permanent errors, and usually disappear spontaneously after a period of time. In transient errors, only individual bits are faulty and there is no permanent damage to the computing hardware. Transient errors may have various causes such as electromagnetic influences, alpha-particles or neutrons.
The emphasis in error handling in communications systems is even presently on transient errors. It is known that when an error is detected in communications systems (e.g., in a controller area network, CAN), the erroneously transmitted data are resent. Furthermore, the use of an error counter is known in communications systems, which is incremented on detection of an error, is decremented when there is a correct transmission, and prevents transmission of data as soon as it exceeds a certain value.
In the case of computing hardware for running computer programs, however, error handling is performed essentially only for permanent errors. Taking transient errors into account is limited to incrementing and, if necessary, decrementing an error counter. This counter reading is stored in a memory and may be read out off-line, i.e., as diagnostic or error information during a visit to a repair shop, e.g., in the case of computing hardware designed as a vehicle control unit. Only then is it possible to respond appropriately to the error.
Error handling via error counters thus, on the one hand, does not allow error handling within a short error tolerance time, which is necessary in particular for safety-relevant systems, and also, on the other hand, does not allow constructive error handling in the sense that the computer program is being run again properly within the error tolerance time. Instead, in the related art, the computer program is switched to emergency operation after exceeding a certain value on the error counter. This means that a different part of the computer program is run instead of the part containing the error, and the substitute values determined in this way are used for further computation. The substitute values may be modeled on the basis of other quantities, for example. Alternatively, the results calculated using the part of the computer program containing the error may be discarded as defective and replaced by standard values that are provided for emergency operation for further calculation. The known methods for handling a transient error of a computer program running on computing hardware thus do not allow any systematic constructive handling of the transient nature of most errors.
It is also known from the related art that transient errors occurring in running a computer program on computing hardware may be eliminated by completely restarting the computing hardware. This approach is also not actually satisfactory, because quantities obtained in processing of the computer program to that point are lost and the computing hardware is unable to fulfill its intended function for the duration of the restart. This is unacceptable in the case of safety-relevant systems in particular.
Finally, it is also known that, for error handling for transient errors of a computer program run on computing hardware, the computer program may be set back by a few clock pulses and individual machine instructions of the computer program may be repeated. This method is also known as micro-rollback. With the known method, the system only returns by objects on a machine level (clock pulses, machine instructions). This requires appropriate hardware support on a machine level, which is associated with a considerable complexity in the area of the computing hardware. It is impossible for the known method to be executed exclusively under software control.
The error handling mechanisms known from the related art are unable to respond in a suitable manner to transient errors occurring in running a computer program on computing hardware.
However, transient errors are especially frequent in future technologies. If they are detected, e.g., via dual core mechanisms, the question of error localization still remains to be answered in order to identify the correct result. This is true even more so if one has the goal that a transient error does not always result in restarting the computer. As described, error localization can typically only be achieved via comparatively complex methods.
The object of the present invention is to provide a constructive means of handling transient errors in running a computer program on computing hardware in such a way that the full functionality and functional reliability of the computer system are restored within the shortest possible error tolerance time.
To achieve this object, starting from a method of the type mentioned at the outset, when an error is detected, at least one program object that has already been sent for execution is set to a defined state and started up again from this state.
On a system level, the question nevertheless remains how to sensibly implement such a concept of task repetition. As a rule, it is not the case that any erroneous task can simply be re-computed since the additionally required computing time and also the point in time used therefor are planned from the system viewpoint to be used differently. If the workload of the processor is already close to 100% (and this is generally the case), such an unscheduled additional load (which a task repetition represents) generates a system overload which typically may result in a crash. This is even more pronounced when time-controlled systems are considered (which, as it becomes apparent, will prevail at least to some extent). A deadline violation is not tolerable in these systems, just as little as in most of the other hard real-time concepts.
From the system viewpoint, the consequence arises that the additional load, which may result from a potential task repetition, must be scheduled. If the computing time needed for a task repetition is reserved after each task, then this may certainly work; however, 100% additional performance must be paid for compared to a system which does not handle errors. This is unacceptable from the cost point of view.
Furthermore, it is the object of the present invention to provide an optimum system strategy, which does not always schedule the double computation of a task (thus generating a permanent and very large overhead), and which at the same time solves the issue of how to combine that with time-controlled approaches.