The present invention generally relates to memory systems, and more particularly to a memory with self-healing capability.
The size of available server memory systems is constantly increasing with time, with current server memory systems often being in the range of 64 Gbytes or larger. As the size of memory systems increases, and memory cell sizes get smaller, the probability of a memory bit failing, and thus the memory system failing, increases. Memory system failures can be the result of both temporary and permanent errors. Temporary errors are typically due to alpha particles. Permanent errors are typically the result of a memory cell wall failure, or to a much smaller degree a row or column decoder failure, a state machine failure or other catastrophic failures such as the failure of the mechanical interface between the dual in-line memory module (DIMM) printed circuit board and the system printed circuit board.
A problem with prior art memory configurations is related to the I/O pin count on the memory I/O controller. With increasingly demanding memory storage requirements, the number of memory modules connected to the memory controller is increasing. In a serial architecture configuration, increasing the number of memory channels by one increases the number of memory channels directly connected to the memory controller by one. This is problematic, since each additional memory channel requires additional input pins, increasing the system pin count, and increasing the probability of failure. This is especially problematic in memory storage systems having a large memory capacity. In addition, the additional memory bus loading from additional modules limits the memory system bus speed.
In order to prevent memory system failures, different forms of memory detection and correction processes have evolved. One commonly used system involves the use of parity bits to detect errors. When data is received, the parity of the data is checked against an expected value. When the data does not match the expected parity value (odd or even), an error is determined to have occurred. Although this method works for determining single bit errors, it does not always work well for determining multiple bit errors. Further, the simplest parity systems have no mechanism for correcting data errors.
One commonly used error detection and correction process uses error correcting or error checking and correction codes (ECC). ECC is typically based on CRC (cyclic redundancy checksum or cyclic redundancy code) algorithms. ECC codes can be used to restore the original data if an error occurs that is not too disastrous. With CRC algorithms, when data is received, the complete data sequence (which includes CRC bits appended to the end of the data field) is read by a CRC checker. The complete data sequence should be exactly divisible by a CRC polynomial. If the complete data sequence is not divisible by a CRC polynomial, an error is deemed to have occurred.
Unlike conventional error correction processes based on parity, systems based on ECC codes can typically be used to detect multiple bit errors. For example, an ECC memory system that has single bit correction typically can detect double bit errors and correct single bit errors. An ECC memory with 4 or 8 bit error correction can typically detect and correct 4 bit or 8 bit errors, respectively. Therefore, the failure of an entire Synchronous Dynamic Random Access Memory (SDRAM) chip organized in a xc3x974 or xc3x978 configuration will not cause the system to fail. Although ECC systems easily provide multiple bit error detection, a problem with conventional ECC systems is that they typically cause the system to halt when they report an uncorrectable error. Thus, a failed part in an ECC memory system cannot be replaced to restore the system failure immunity without first halting.
ECC and parity processes are commonly used in computer systems that rely on semiconductor memory where 24 hour a day, 7 day a week operation is required. Where computer systems do not require the speed required by semiconductor memory or alternatively where larger storage cannot be provided cost effectively by semiconductor memory, disk drive memory may be used. Computer systems supported by disk drive memory typically link together a plurality of disk drives through hardware to form a drive array known as a redundant array of inexpensive disks (RAID). The drives in the array are coordinated with each other and data is specially allocated between them. Because disk drives and the mechanical interfaces for disk drives are less reliable than semiconductor memory, the processes for data recovery due to permanent or temporary data failure for disk drive systems typically provide more redundancies and more complex data recovery methodologies.
Traditionally in a RAID system, data is split between the drives at the bit or byte level. For example, in a four drive system, two bits of every byte might come from the first hard disk, while the next two bits come from the second hard disk, and so on. The four drives then output a single byte data stream four times faster than a serial drive implementation, because transferring all of the information in a byte takes only as long as required for a single drive to transfer two bits. This technique of splitting data between several drives is referred to as data striping, and the actual block size per drive can be as high as 1000 bytes. The RAID memory implementation offers improved reliability and greater resistance to errors than can be achieved by operating each disk independently. The increased reliability and fault-tolerance is achieved through various redundancy measures, including mirroring and parity implementations.
The parity and ECC techniques discussed above are unable to correct xe2x80x9chardxe2x80x9d or permanent errors in memories. In previous memory repair schemes, repairs have typically been done off-line with laser fuses. The repairs typically occur in DRAM manufacturing, and not after the memory is sent to the end customer.
It would be desirable to provide an on-line self-healing memory, without the disadvantages found in existing memory schemes, and without the requirement to add additional memory modules for spare memory.
The present invention provides a self-healing memory device responsive to command signals. The memory device includes multiple banks of memory arrays. Each bank includes a plurality of primary storage cells and a spare unit of spare storage cells. A detector detects an error in a first unit of the primary storage cells in a first one of the banks. A controller responsive to command signals automatically re-maps the first unit of the primary storage cells to the spare unit of storage cells.
In one embodiment, the self-healing memory provides on-line self-healing for memory hard bit errors. The memory bit functionality can be restored without shutting off power or swapping memory modules. By using an efficient error correcting code (ECC) scheme with the self-healing memory of the present invention, significant immunity from both hard and soft errors can be obtained. In one embodiment, the self-healing memory scheme is scalable, so that as memory sizes increase, the number of hot spare rows can increase. Because the additional spare memory row is small in comparison to the total number of the rows in a memory bank, the die area and cost impact is negligible.
In one embodiment, the self-healing memory is used in servers requiring 24 hour a day, 7 day a week on-line operation, to improve the reliability of such servers. The self-healing memory may also be used in cost sensitive PCs or workstations to avoid expensive service calls and unnecessary warranty costs. The self-healing technique can also be used for non-DRAM memory, and for CPU caches where on-line correction of hard bit errors is needed.