An example of a memory failure recovery method in an information processing apparatus is described in JP-2000-132462A (hereinafter called “Document 1”). In the following, the memory failure recovery method described in Document 1 will be described with reference to FIGS. 1 and 2.
Referring to FIG. 1, the information processing apparatus described in Document 1 comprises CPU 101, main storage device 102, error detection device (ECC) 103, service processor 104, storage device (ROM) 105 for the service processor, auxiliary storage device 109, and data bus 108 which interconnects these components.
In the information processing apparatus having such a configuration, when an error occurs in a program area of main storage device 102 while the system is operating, a process shown in FIG. 2 is executed. First, error detection device 103 detects the occurrence of the memory error, and confirms the contents of the memory error (S101). A circuit called ECC (Error Check and Correct) is used in error detection device 103. ECC is an error detection and correction circuit which can detect that an erroneous value is recorded in a memory and corrects the erroneous value to a correct value. While a normal ECC is capable of correcting only one bit of error, some other types of ECCs are capable of correcting two bits or multiple bits of error. In Document 1, a normal ECC (capabilities to correct only one bit) is used. When a one-bit error is detected (NO at S102), an error correction is performed by the ECC (S103), followed by termination of the process.
When the error extends over two bits or more, error detection device 103 generates interrupt signals 106, 107 to CPU 101 and service processor 104 to temporarily halt CPU 101 (S104) or to request service processor 104 to execute a recovery process. Service processor 104 acquires data at an address at which the memory error has occurred and data at the address preceding thereto, from a backup file which resides within auxiliary storage device 109 (S105), and writes these data into a recovery area on main storage device 102 (S106). Next, service processor 104 writes a branch instruction into the address immediately before the memory address at which the error has occurred for branching to the data written into the recovery area, and writes a branch instruction immediately after the data written into the recovery area for branching to the address next to the address at which the memory error has occurred in main storage device 102 (S107). Then, service processor 104 changes the value of a program counter to the address immediately before the address at which the memory error has occurred (the address at which the branch instruction has been written) (S108), and releases the temporary halt instruction for CPU 101 (S109).
As CPU 101 resumes its execution, the branch instruction is executed at the address which has been set immediately before the address at which the memory error has occurred, causing CPU 101 to execute a program with those backup data written into the recovery area instead of the data at the address at which the memory error has occurred and data at the address preceding thereto in main storage device 102, and subsequently branch to the data immediately after the address at which the memory error has occurred in main storage device 102. In this way, a program can be continuously executed even if main storage device 102 suffers a memory error which is difficult to correct.