Errors may occur when running a computer program on a computer. Errors may be differentiated according to whether they are caused by the hardware (processor, bus systems, peripheral equipment, etc.) or by the software (application programs, operating systems, BIOS, etc.).
When errors occur, a distinction is made between permanent errors and transient errors. Permanent errors are always present and are based on defective hardware or defectively programmed software, for example. In contrast with these, transient errors occur only temporarily and are also much more difficult to reproduce and predict. In the case of data stored, transmitted, and/or processed in binary form, transient errors occur, for example, due to the fact that individual bits are altered due to electromagnetic effects or radiation (α-radiation, neutron radiation).
A computer program is usually subdivided into multiple run-time objects that are executed sequentially or in parallel on the computer system. Run-time objects include, for example, processes, tasks, or threads. Errors occurring during execution of the computer program may thus be assigned in principle to the run-time object being executed.
Handling of permanent errors is typically based on shutting down the computer system or at least shutting down subsystems. However, this has the disadvantage that the functionality of the computer system or the subsystem is then no longer available. To nevertheless be able to ensure reliable operation, in particular in a safety-relevant environment, the subsystems of a computer system are designed to be redundant, for example.
Transient errors are frequently also handled by shutting down subsystems. It is also known that when transient errors occur, one or more subsystems should be shut down and restarted and it is then possible to infer that the computer program is now running error-free by performing a self-test, for example. If no new error is detected, the subsystem resumes its work. It is possible here for the task interrupted by the error and/or the run-time object being processed at that time not to be executed further (forward recovery). Forward recovery is used in real-time-capable systems, for example.
With non-real-time-capable applications in particular, it is known that checkpoints may be used at preselectable locations in a computer program and/or run-time object. If a transient error occurs and the subsystem is consequently restarted, the task is resumed at the checkpoint processed last. Such a method is known as backward recovery and is used, for example, with computer systems that are used for performing transactions in financial markets.
The known methods for handling transient errors have the disadvantage that the entire computer system, or at least subsystems, is unavailable temporarily, which may result in data loss and delay in running the computer program.
Therefore an object of the present invention is to handle an error occurring in running a computer program on a computer system in the most flexible possible manner and thereby ensure the highest possible availability of the computer system.