1. Field of the Invention
This invention relates generally to computer systems and, more particularly, to error handling in a memory system.
2. Background of the Related Art
This section is intended to introduce the reader to various aspects of art which may be related to various aspects of the present invention which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Computer systems, such as the personal computers and servers, rely on microprocessors, associated chip sets, and memory chips to perform most of their processing functions. In contrast to the dramatic improvements of the processing portions of a computer system, the mass storage portion of a computer system has experienced only modest growth in speed and reliability. As a result, computer systems fail to capitalize fully on the increased speed of the improving processing systems due to the dramatically inferior capabilities of the mass data storage devices coupled to the systems.
While the speed of these mass storage devices, such as magnetic disk drives, has not improved much in recent years, the size of such disk drives has become smaller while maintaining the same or greater storage capacity. Furthermore, such disk drives have become less expensive. To capitalize on these benefits, it was recognized that a high capacity data storage system could be realized by organizing multiple small disk drives into an array of drives. However, it was further recognized that large numbers of smaller disk drives dramatically increased the chance of a disk drive failure which, in turn, increases the risk of data loss. Accordingly, this problem has been addressed by including redundancy in the disk drive arrays so that data lost on any failed disk drive can be reconstructed through the redundant information stored on the other disk drives. This technology has been commonly referred to as “redundant arrays of inexpensive disks” (RAID).
To date, at least five different levels of RAID have been introduced. The first RAID level (“RAID Level 1”) utilizes mirrored devices. In other words, data is written identically to at least two disks. Thus, if one disk fails, the data can be retrieved from one of the other disks. Of course, a RAID Level 1 system requires the cost of an additional disk without increasing overall memory capacity in exchange for decreased likelihood of data loss. The second level of RAID (“RAID Level 2”) implements an error code correction or “ECC” (also called “error check and correct”) scheme where additional check disks are provided to detect single errors, identify the failed disk, and correct the disk with the error. The third level RAID system (“RAID Level 3”) stripes data at a byte-level across several drives and stores parity data in one drive. RAID Level 3 systems generally use hardware support to efficiently facilitate the byte-level striping. The fourth level of RAID (“RAID Level 4”) stripes data at a block-level across several drives, with parity stored on one drive. The parity information allows recovery from the failure of any single drive. The performance of a RAID Level 4 array is good for read requests. Writes, however, may require that parity data be updated each time. This slows small random writes, in particular, though large writes or sequential writes may be comparably faster. Because only one drive in the array stores redundant data, the cost per megabyte of a RAID Level 4 system may be fairly low. Finally, a level 5 RAID system (“RAID Level 5”) provides block-level memory striping where data and parity information are distributed in some form throughout the disk drives in the array. Advantageously, RAID Level 5 systems may increase the processing speed of small write requests in a multi-processor system since the parity disk does not become a system bottleneck.
The implementation of data redundancy, such as in the RAID schemes discussed above, provides fault tolerant computer systems wherein the system may still operate without data loss, even if one drive fails. This is contrasted to a disk drive array in a non-fault tolerant system where the entire system is considered to have failed if any one of the drives fails. Of course, it should be appreciated that each RAID scheme necessarily trades some overall storage capacity and additional expense in favor of fault tolerant capability. Thus, RAID systems are primarily found in computers performing mission critical functions where failures are not easily tolerated. Such functions may include, for example, a network server, a web server, a communication server, etc. One of the primary advantages of a fault tolerant mass data storage system is that it permits the system to operate even in the presence of errors that would otherwise cause the system to malfunction. As discussed previously, this is particularly important in critical systems where downtime may cause relatively major economic repercussions.
As with disk arrays, memory devices may be arranged to form memory arrays. For instance, a number of Dynamic Random Access Memory (DRAM) devices may be configured to form a single memory module, such as a Dual Inline Memory Module (DIMM). The memory chips on each DIMM are typically selected from one or more DRAM technologies, such as synchronous DRAM, double data rate SDRAM, direct-RAM bus, and synclink DRAM, for example. Typically, DIMMs are organized into an X4 (4-bit wide), an X8 (8-bit wide), or larger fashion. In other words, the memory chips on the DIMM are either 4-bits wide, 8-bits wide, 16-bits wide or 32-bits wide. To produce a 72-bit data word using an X4 memory organization, an exemplary DIMM may include nine 4-bit wide memory chips located on one side of the DIMM and nine 4-bit wide memory chips located on the opposite side of the DIMM. Conversely, to produce a 72-bit data word using an X8 memory organization, an exemplary DIMM may include nine 8-bit wide memory chips located on a single side of the DIMM. The memory modules may be arranged to form memory segments and the memory segments may be combined to form memory arrays. Controlling the access to and from the memory devices as quickly as possible while adhering to layout limitations and maintaining as much fault tolerance as possible is a challenge to system designers.
One mechanism for improving fault tolerance is to provide a mechanism such as an Error Checking and Correcting (ECC) algorithm. ECC is a data encoding and decoding scheme that uses additional data bits to provide error checking and correcting capabilities. Today's standard ECC algorithms, such as the Intel P6 algorithm, can detect single-bit or multi-bit errors within an X4 memory device. Further, typical ECC algorithms provide for single-bit error correction (SEC). However, typical ECC algorithms alone may not be able to correct multi-bit errors. Further, while typical ECC algorithms may be able to detect single bit errors in X8 devices, they cannot reliably detect multi-bit errors in X8 devices, much less correct those errors. In fact, approximately 25% of all possible multi-bit errors within an X8 memory device are either undetected or wrongly detected as single-bit errors or “misaliased.” Misaliasing refers to multi-bit error conditions in an X8 (or larger) memory device that defeat standard ECC algorithms such that the multi-bit errors appear to the ECC logic to be either correct data or data with a single-bit correctable error.