S/390 and IBM are registered trademarks of International Business Machines Corporation, Armonk, N.Y., U.S.A. and Lutus is a registered trademark of its subsidiary Lotus Development Corporation, an independent subsidiary of International Business Machines Corporation, Armonk, N.Y. Other names may be registered trademarks or product names of International Business Machines Corporation or other companies.
In a computer RAM memory system, data is stored in DRAMs mounted on memory cards. The DRAMs store data in the form of electrical charges in semiconductor arrays. In one typical system, data is stored on the DRAMs of the memory cards in the form of double words, each comprising 64 data bits and 8 error checking bits making a total of 72 bits per double word. The error checking bits (ECC bits), function to indicate when an error exists in the 64 data bits and indicate if the error is a single bit error or a multiple bit error. A single bit error describes the condition when just one of 64 data bits of a double word is in error. A multiple bit error describes the condition when two or more data bits in a double word is in error. A double bit error describes the condition when two bits in a double word are in error. If the ECC bits indicate that the error is a single bit error, the ECC bits will also indicate which data bit is in error and thus enable it to be corrected.
Scrubbing is an operation in which the ECC bits are used to detect and correct single bit errors. Sparing is an operation in which a DRAM on a memory card, determined to be defective, is logically replaced with a spare DRAM mounted on the memory card for that purpose. In U.S. Pat. No. 5,267,242, invented by Lavallee et al., issued Nov. 30, 1993, a scrubbing and sparing operation for a RAM memory system is disclosed. In this system, a hardware unit, called a hardware memory tester (HAMT), is used to perform the sparing and scrubbing operation.
In the system of the patent, a single spare DRAM is provided on each memory card and sparing is carried out in response to testing the DRAMs on the memory card by reading out data, restoring the data and then comparing the restored data with the original data read out from the memory. Counters are provided for each bit position in data being read out from the memory card in parallel and when an error is detected, the corresponding counter is incremented. If the count in a counter reaches a predetermined threshold, this indicates that the DRAM corresponding to the counter is defective and the defective DRAM is logically replaced with the spare DRAM mounted on the memory card. Scrubbing is carried out by using the ECC bits in each double word to correct single bit errors. If multiple bit errors are detected, they are ignored.
In the system of the present invention, each memory card is provided with a plurality of spare DRAMs. In the specific embodiment of the invention, four DRAMs are provided for each memory card. During initial machine loading of the memory card, the memory is subjected to a self-test operation during which data is written to and read out from each memory location. The data read out from the memory location is compared with the corresponding data written to the corresponding memory location. When the comparison indicates an error, a counter corresponding to the bit position in which the bit occurred is incremented. When the count in a counter exceeds a threshold level for a given chip row, the DRAM corresponding to such counter is considered to be a defective DRAM and is logically replaced by a spare DRAM. The logical replacement of a DRAM is called sparing the replaced DRAM. When the self-test detects two errors in the same double word, the DRAM corresponding to the counter with the highest count in the chip row is spared whether or not the count in such counter has exceeded the threshold level. Because there is a strong probability that the DRAM corresponding to the counter registering the highest count will correspond to one of the bits in the multiple bit error, the sparing of the DRAM corresponding to the counter having the highest count will be likely to convert a double bit error into a correctable single bit error.
Scrubbing is carried out periodically at fixed intervals of time and is executed in the background of the normal system of operation. During scrubbing, whenever the ECC bits of a double word indicate a single bit error has occurred, the error is corrected and the counter corresponding to the bit position of the error is incremented. At the end of the scrubbing of a chip row, the counters are examined and if any of the counters exceed the threshold, the corresponding DRAM is spared. The scrubbing operation of the chip row is then repeated in a mode of operation called half spare mode. If, during the first scrubbing cycle a multiple bit error is detected, the data in the double word containing the multiple bit error is left unchanged. The sparing carried out at the end of the first scrubbing cycle through the chip row may convert a double bit error to a single bit correctable error. If during the scrubbing cycle in the half spare mode through the chip row, a multiple bit error is detected, a special uncorrectable error pattern (xe2x80x9cSPUE tagxe2x80x9d) is appended to the double bit word containing the multiple bit error to prevent data from being stored or read out from the address position of the double word containing the bit error. In addition, the second scrubbing cycle through the chip row will move corrected data from the defective DRAMs to the spare DRAMS replacing the defective DRAMS in the sparing operation. Following the half spare mode, the chip row is then scrubbed again in what is called the full spare mode, and again any single bit errors detected are corrected. In the full spare mode, any multiple bit errors are ignored.
In the system as described above, single bit errors are corrected by scrubbing, double bit errors are converted to single bit errors and the storage locations corresponding to double bit errors which cannot be corrected are effectively removed from the system and prevented from being accessed in conventional store and fetch operations.