The invention is generally related to data processing systems such as computers and like electronic devices, and more particularly, to error detection and correction in a memory array implemented in a data processing system.
Ensuring the integrity of data processed by a data processing system such as a computer or like electronic device is critical for the reliable operation of such a system. Data integrity is of particular concern, for example, in fault tolerant applications such as servers, databases, scientific computers, and the like, where any errors whatsoever could jeopardize the accuracy of complex operations and/or cause system crashes that affect large numbers of users.
Data integrity issues are a concern, for example, for many solid state memory arrays such as those used as the main working storage repository for a data processing system. Solid state memory arrays are typically implemented using multiple integrated circuit memory devices such as static or dynamic random access memory (SRAM or DRAM) devices. Such devices may be subject to a number of errors throughout their lifetimes, including what are referred to as xe2x80x9chardxe2x80x9d and xe2x80x9csoftxe2x80x9d errors.
A hard error is generally a permanent failure of all or a portion of a memory device, typically due to a latent defect during manufacturing or some electrical disturbance that occurs during the operation of the device. A hard error, for example, may affect a single memory cell, a row or column of memory cells, an input/output port, or even an entire device. A soft error is generally a transient alteration in the state of data stored in a memory cell, often due to natural effects such as cosmic rays or alpha particles, and can be corrected by shutting down the device and restarting. Despite significant improvements in circuit fabrication technologies, however, no memory device is completely immune from errors, and as such, significant development efforts have been directed at handling such errors in a manner that ensures the continued integrity of the data stored in the memory array.
For example, complex error correction code (ECC) algorithms have been developed to address some of the errors that may arise during the operation of a memory array. Data is typically stored in a memory array in binary form including a plurality of bits arranged together to form logical xe2x80x9cwordsxe2x80x9d of data. Most ECC algorithms typically address the situation where a single bit in a word is faulty, an error condition known as a xe2x80x9csingle bit errorxe2x80x9d. To do so, most ECC algorithms store a separate error correction code along with a word of data, and a complex mathematical algorithm is used to both detect and correct for any single bit error in the word.
ECC algorithms typically cannot address the situation where multiple bits in a word are faulty. However, since single bit errors comprise the vast majority of all errors experienced in a memory array, ECC algorithms do a great deal to improve the data integrity of ECC-capable data processing systems. Moreover, to minimize the likelihood of multi-bit errors (also referred to as unrecoverable errors), many memory arrays arrange memory devices such that each device provides no more than one bit in any given word in the memory address space. Consequently, failure of any given device, or portion thereof, will only cause an unrecoverable error if an error is also present in another device that provides another bit of the same word.
Additional integrity protection may be available through the use of a redundant memory array, in which a portion of the data width in the memory array is reserved for use whenever an error is detected in another portion of the memory array. In systems in which no memory device supplies more than one bit of data for any given word, a failure of a particular device results in a failure in one bit of a word, and a process known as redundant bit steering (RBS) is used to redirect, or xe2x80x9csteerxe2x80x9d dataflow from the failed bit to a redundant bit allocated in the reserved space of the redundant memory array.
Often, the reserved space in a redundant memory array is allocated on a separate, dedicated memory device, although the reserved space could be allocated in existing devices as well. Regardless, when all or a portion of the addressable space of a device is determined to be faulty, the process of redirecting dataflow from the failed bit to a redundant bit is referred to as xe2x80x9creplacingxe2x80x9d the failed device with a redundant device, irrespective of the fact that other memory accesses allocated to the failed device may continue to be processed.
Replacing a failed device with a redundant device necessarily requires that the redundant device be initialized with the data from the failed device. The most straightforward manner of doing so would be to simply prohibit access to the memory array, copy the affected data over from the failed device to the redundant device, and then switch over to the redundant device for all future accesses. However, in most fault tolerant applications, it is not possible to prohibit accesses to the memory array for any appreciable amount of time. Consequently, a redundant memory array typically must be capable of handling non-initialization operations concurrently with initialization of a redundant memory device.
For example, one manner of initializing a redundant device while maintaining the availability of the memory array is to simply switch over to the redundant device and allow ECC logic to correct any single bit errors in the redundant bit supplied by the device. Over time, stores to the redundant device would fill the device with correct data. However, without initialization, the initial state of the data in the redundant device at the time of switchover cannot be known, and as such, statistically 50% of all accesses involving the redundant device will require the redundant bit to be corrected. Reliability then becomes a concern with this approach, since for any single-bit error in another memory device, there is roughly a 50% chance that an error in the redundant bit will also occur, resulting in an unrecoverable multi-bit error.
Another approach is to attempt to copy data over from the failed device to the redundant device concurrently with the processing regular store and fetch operations submitted to the memory array, a process known as xe2x80x9ccleaningxe2x80x9d the redundant device. Data is typically copied over by sequentially fetching segments of data allocated to a failed memory device, passing the data through normal ECC logic, and storing the corrected data segments back into the redundant device. During such operations, however, typically the hardware that substitutes the redundant device for the failed device is controlled such that regular fetch or store operations directed to an uncleaned area of the failed device are directed to the failed device, while operations directed to the cleaned area are directed to the redundant device. In practice, however, such operations are problematic to implement given that the boundary of the cleaned and uncleaned areas is constantly moving, making it difficult to determine whether an access is directed to a cleaned or uncleaned area of a device. These difficulties increase the risk that the failed device will be utilized for accesses to the cleaned area, or that the redundant device will be utilized for accesses to the uncleaned area, introducing potential data integrity concerns. Moreover, the logic required to properly control the dataflow between the failed and redundant devices is more complex, requiring additional hardware and increasing the cost of a memory controller design.
Therefore, a significant need continues to exist in the art for an improved manner of initializing a redundant device in a redundant memory array, and in particular, for an improved manner of initializing a redundant device which provides fast and efficient initialization while maintaining data integrity.
The invention addresses these and other problems associated with the prior art by providing an apparatus and method of initializing a redundant memory device in which, during initialization of the redundant memory device, the switchover of non-initialization fetch operations to the redundant memory device is delayed until after initialization of the redundant memory device is complete. Consequently, during initialization, the non-initialization fetch operations are directed to the failed memory device, while non-initialization store operations are directed to the redundant device.
Embodiments of the invention utilize operation type-based routing as disclosed herein to take advantage of the fact that, in most situations, the errors that cause a memory device to fail are relatively small when compared to the overall device size, and as such, most of the data is still correct in the failed device. Consequently, by only switching store operations to the redundant device and continuing to direct fetch operations to the failed device while the redundant device is being initialized, fetch operations during the initialization process, which might otherwise introduce single bit errors were they directed to the redundant device as a result of the redundant device not being completely initialized, are instead read from the predominantly correct data stored in the failed device. And with a lower frequency of single bit errors, the likelihood of bit error in another device causing an unrecoverable error that cannot be otherwise corrected is reduced.
These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.