In recent years, use of electronic apparatuses has grown, and places where electronic apparatuses to perform complicated high-level processes are increasing. An information processing apparatus is an example of such an electronic apparatus. When an electronic apparatus is performing an important process, it is desirable to avoid having the process halt partway through. Thus there is a demand for electronic apparatuses with high availability that are capable of avoiding a fault or quickly recovering from a fault when the fault occurs. In some electronic apparatuses from which high availability is desired, a dual in-line memory module (DIMM) is used.
DIMMs include synchronous dynamic random access memory (SDRAM ICs) integrated circuits (IC). However, a SDRAM IC may have a high failure occurrence rate. Therefore, if a DIMM is used without protection for data, the data stored in the DIMM may become corrupted, which may significantly affect processes performed by electronic apparatuses. In response, techniques of using DIMMs having an error-correcting code (ECC) functionality that detects and corrects errors, allows the availability of electronic apparatuses to be enhanced. Specifically, writing data into a DIMM with an ECC makes it possible to correct an error. More specifically, when a failure occurs in an SDRAM IC and data stored in the DIMM is corrupted, if the failure is limited to only one SDRAM IC, it is possible to correct the error, which allows operations to continue. However, when the apparatus is being operated in a state in which one SDRAM IC is already corrupted and the apparatus operates after performing error correction, if a further failure occurs in another SDRAM IC, the further failure results in an unrecoverable corruption of the DIMM.
In the ECC function, a check bit is produced from data according to a particular formula, and the check bit is written as an ECC together with the data. When the data is read, a symptom code is generated by recalculating the check bit for the read data. The symptom code is position information identifying an error position. That is, use of the symptom code makes it possible to identify the location of a corrupted bit in the DIMM. In a DIMM having an ECC function, if an error occurs only in one SDRAM IC, an error position is identified, but if errors occur in two or more SDRAM ICs, error positions are not identified although an occurrence of errors is detected. By using a greater number of check bits, it may become possible to identify error positions in a plurality of SDRAM ICs. However, increasing the number of check bits results in insufficiency of the bit width available in one DIMM. Thus, to make it possible to handle errors in a plurality of SDRAM ICs, a plurality of DIMMs may be used to increase the number of bits for each ECC.
There is a technique wherein an auxiliary SDRAM IC is disposed in a DIMM such that when an error is detected in an SDRAM IC, data is moved from the SDRAM IC where the error occurred into the auxiliary SDRAM IC thereby allowing operations to continue by using the auxiliary SDRAM IC. With a DIMM that uses this technique, it becomes possible to handle up to two errors if the errors occur in a single SDRAM IC. However, when an auxiliary SDRAM IC is being used, if a further error occurs in another SDRAM IC, the system may go into a vulnerable state. Therefore, use of an auxiliary SDRAM IC is a temporary step until the DIMM is exchanged.
Furthermore, use of auxiliary SDRAM IC may result in a DIMM taking up greater space due to physical considerations for the SDRAM IC and interconnections therefor. This makes it difficult for recent electronic apparatuses with a small size to find sufficient space to install a DIMM that includes an auxiliary SDRAM IC. Japanese Laid-open Patent Publication No. 2010-102640 is known as an example of related art. To handle the above problem, the related art discloses a technique in which a DIMM is configured to have two ranks—one for normal use and the other for an auxiliary rank thereby achieving a redundancy. Use of the word “rank” refers to a collection, with respect to a DIMM, that is a memory component. More specifically, a DIMM is usable in units of ranks, and each rank is a unit of access to a DIMM. That is, in a DIMM having a plurality of ranks, reading and writing data may be performed independently for each of rank. In this conventional technique, for example, the rank normally used and the auxiliary rank are initialized by entirely filling with zeros, and a normal ECC is added. When an error occurs in a rank being currently used, data is moved into the auxiliary rank and the operation is continued using the auxiliary rank.
However, in the related art, when a failure occurs in a memory controller or the like, there is a possibility that unpredicted exchange of ranks occurs, and this may cause a problem that even though data is read from the auxiliary rank, no ECC error occurs and the read data is treated as being normal data although the data is actually not intended data. Furthermore, when a copy failure occurs and even though data from before the copy remains in the auxiliary rank, despite use of the auxiliary rank an ECC is unable to detect an error.