A technique called checkpoint/restart has been disclosed as a technique for improving the reliability of computer systems (see, for example, “A Survey of Checkpoint/Restart Implementations”, retrieved online Aug. 24, 2010, <URL:https://ftg.lbl.gov/CheckpointRestart/Pubs/checkpointSurvey-020724b.pdf>). This technique is a method of periodically backing up state information of individual applications or of the entire system so that if failure occurs, the system is restored to the state at the point in time of the backup, from which execution is resumed. As used herein, state information includes the contents of memory and processor register information.
The technique disclosed in “A Survey of Checkpoint/Restart Implementations” is a technique devised to eliminate the need to re-execute a process from the start if failure occurs during a high-level simulation process taking several days for calculation in the field of large-scale computers. Accordingly, a single application program works at one time and checkpoint/restart is used to achieve high reliability for each application. In recent years, an embedded system also performs a process requiring high reliability such as automobile control and, to prevent long-term uncontrollability upon the occurrence of failure, such a system is configured to return to the process promptly after the occurrence of failure, by using the technique of “A Survey of Checkpoint/Restart Implementations”.
In a case of applying the technique of “A Survey of Checkpoint/Restart Implementations” to an embedded system, since multiple applications work cooperatively in the embedded system, a checkpoint/restart has to be set to be performed for all the applications, resulting in decreased development efficiency. Since an embedded system has fewer CPUs and less memory as compared with a large-scale computer, in the case of applying the technique of “A Survey of Checkpoint/Restart Implementations” to an embedded system, the checkpoint/restart is executed for the entire embedded system by the OS, etc.
As a technique to cope with the occurrence of failure in a multi-core processor system having multiple CPUs, a technique has been disclosed, for example, in which thread execution information is saved to memory so that if failure occurs at a CPU, the CPU is substituted with another CPU to execute the process (see, for example, Japanese Laid-Open Patent Publication No. 2006-139621). Another technique has been disclosed in which the states of processes under execution are collectively monitored by a monitoring device (see, for example, Japanese Laid-Open Patent Publication No. 2008-310632).
Applying the techniques disclosed in “A Survey of Checkpoint/Restart Implementations” and Japanese Laid-Open Patent Publication No. 2008-310632 to a restoration process executed upon the occurrence of failure in a multi-core processor system yields a technique in which a specific CPU performs a process of saving the state information of the entire multi-core processor system (hereinafter, “Prior Art 1”). An application of Prior Art 1 enables a multi-core processor system to restore the state using the saved state information when failure occurs.