1. Field of the Invention
The present application generally relates to error control coding methods for computer memory systems and, more particularly, to accessing error-control coded data in pieces smaller than one line, yet maintaining a high level of reliability. Such a mode of operation will be called subline access, which has its name derived from computer memory systems, where a cache line has a fixed size of, say, 128 bytes for IBM systems. One motivation to consider such a mode of data access is to increase efficiency; e.g., to allow more concurrent accesses, to reduce contention, latency and to conserve power, albeit depending on the application, there may be many other reasons to consider such an option.
2. Background Description
Codes protect data against errors. FIG. 1 illustrates lines of data in computer memory where the shaded portions represent data and the open portions represent redundant symbols. For the purposes of our discussion, the lines of data are broken into sublines of data, and the redundant symbols are distributed amongst the sublines of data. When a line of data is read from memory, the combined redundant symbols are used to provide error detection/correction capabilities, thereby protecting the integrity of the data.
When we read a portion of a codeword, the idea is of course that small pieces of data shall still be protected with certain error detection/correction capabilities. A trivial “solution” to the problem is to have each subline by itself have enough error correction capability as required for the whole codeword in the worst case, as shown, for example, in FIG. 2. This certainly would work and also offers the most independence/concurrency among subline accesses, which may become important in certain cases, but in general is too wasteful in requiring a large overhead.
In FIG. 3, we illustrate a DIMM (dual in-line memory module) 10 that consists of 19 DRAM (dynamic random access memory) devices, sixteen DRAM devices d1-d16 storing data and three redundant DRAM devices r1-r3 storing redundant symbols, which operate in parallel. Each DRAM device in this figure has a total of eight I/O (input/output) pins which connect to the memory hub chip 12, which communicates with the memory controller 14 of the processor chip 16 over high speed communication channels 18, and every access to the DRAM results in reading or writing information sent through those eight I/O pins during a minimum of eight consecutive transfers. Those skilled in the art will recognize that the above is a minimum burst 8, ×8 DRAM device. An illustration of the transmission pattern of the DRAM device cited above can be found in FIG. 4. It is evident that during a read operation, sixteen of these DRAMs will provide a total of 128 bytes (further abbreviated 128 B, with a similar notation for other situations), which is the cache line that is commonly present on PowerPC® microprocessors manufactured by International Business Machines Corp. (IBM). The three DRAMs r1-r3 shown in FIG. 3 are employed to store redundancy symbols to protect the integrity of the information stored in the other sixteen DRAMs.
A standard [19,16] shortened Reed-Solomon code constructed on GF(256) may be used as follows. Assign one (byte) symbol to each individual DRAM transfer over its eight I/O channels. The code is applied across the nineteen DRAMs in parallel, and independently on each of the eight transfers. An illustration of this coding scheme can be found in FIG. 5. It is well known in the art that such a [19,16] shortened Reed-Solomon code has a minimum distance of four symbols and therefore it is capable of correcting any one chip and, at the same time, fully detecting the existence of a double chip error.
Now suppose that one desires to access information from the DIMM in a granularity of 64 B (sixty-four bytes), instead of 128 B. Since at the present time the most common cache line in PowerPC® microprocessors is 128 B, one may say that one desires to make 64 B subline accesses. Reference is made to FIG. 6, wherein like reference numerals designate like components of the memory system shown in FIG. 1. Due to the fact that these DRAM devices have a minimum burst length of eight, the natural way to accomplish this is to partition the sixteen DRAM devices that are supplying data into two groups, here designated as Group 1 and Group 2. In Group 1, the memory chips are designated as c1,1-c1,9, while in Group 2, the memory chips are designated as c2,1-c2,10. These two groups communicate with memory hub chip 12a which, in turn, communicates with the memory controller 16. It is worth noting that in the DDR3 (double data rate 3) generation of DRAM devices, the minimal burst length that is allowed is also eight. This illustrates that this constraint is in fact motivated from basic technology parameters.
Since the number of DRAM devices that are devoted to redundancy is odd (three in this example), we cannot distribute them evenly among the two groups. In this example, Group 1 retains one redundant DRAM, whereas group two is allocated two redundant DRAMs. Now, let us analyze the level of reliability that one obtains by using shortened Reed-Solomon codes applied independently on each of the groups. For Group 2, one may employ a [10,8] shortened Reed-Solomon code for each of the transfers of the group of DRAMs (as described above for the first setting discussed). This enables the system to correct any error that may arise from that group. On the other hand, Group 1, we can only use a [9,8] shortened Reed-Solomon code. It is well known in the art that this code can only detect single symbol errors, and therefore the reliability characteristics of the DRAM devices of Group 1 are such that a single chip error may be detected but not corrected. It is worth noting that using a [18,16] code on the Group 1 transfers by taking two DRAM transfers instead of one does not result in the desired effect of correcting a single chip error because there are potentially up to two errors, and the [18,16] code can only correct up to one error, that is, if 100% error correction is desired. Longer codes applied over larger fractions of a DRAM burst have similar inconveniences.
The above illustrates that accessing smaller amounts of data in a memory in some instances results in a loss of available reliability. In the case of 128 B granularity of access, there is single chip error correction and double error detection, whereas in the case of 64 B granularity of access, a simple application of independent codes results in one of the groups not being able to correct all single chip errors. This is not an artificial result of having selected an odd number for the total number of redundant chips. If one had chosen four chips total, then it is easy to see that the system with 128 B access granularity would be able to do double chip error corrections, whereas 64 B access granularity (with two redundant chips on each group) would only be able to do single chip error correction.
The phenomenon described above is further exacerbated as the desired unit of access becomes smaller. Taking again the example in which a total of four additional redundant chips are given, if the desired unit of access is 32 B, then only one chip is allocated for every 32 B group, and only single chip error detection is attained.
As a result of the discussion above, it is often the case that one chooses to access information in sufficiently large lines so that reliability is not an issue, which in turn is associated with a number of drawbacks. For example, in memories where concurrent requests can be serviced, it may be that fewer such requests can in principle be serviced due to the fact that the larger line results in more resources from the memory being in a busy state. Other drawbacks include increased power consumption, due to the activation of a larger number of resources in the memory, and/or an increased usage of the communication channels that connect the memory with the system that uses it. A recent trend in adding more processing cores in a processor chip strains the buses that connect the processor chip with its memory subsystem and in some instances the result is a trend to design memories with smaller access granularities, with the reliability drawbacks noted above.
The description of the issues above serves as a motivation for this invention, in which we disclose a memory augmented with special error control codes and read/write algorithms to improve upon the problem exposed. In order to maximize the scope of our invention, we also disclose novel error control methods that in some instances result in improved redundancy/reliability tradeoffs. We include a detailed description of the optimality properties that one in general may desire from codes for this application. We phrase our invention using the terminology “line/subline”, where subline is the desired (smaller) common access granularity and line is the access granularity that is used during an error correction stage. The general aspect of the error control coding techniques that we use is that a two level coding structure is applied with a first level for the sublines permitting reliable subline accesses correcting and detecting possible errors up to a prescribed threshold, and then a second level permitting further correction of errors found. We note that in the future what we are currently calling a subline may be referred to as a line in microprocessors and what we call a line will necessitate a different terminology; for example “block of lines”.
It is noted that in the related field of hard drive storage technology a number of inventions have been made that employ error control. The known inventions are listed and discussed below.
In U.S. Pat. No. 4,525,838 for “Multibyte Error Correcting System Involving A Two-Level Code Structure” by Arvind M. Patel and assigned to IBM, a method is disclosed whereby small data chunks are protected with a first level of code and then multiple such small data chunks are protected using a shared, second level of code. The motivation cited for the invention lies on that conventional coding techniques impose restrictions on the blocklength of the code coming from algebraic considerations of their construction. For example, when the Galois Field that is used to construct the code has cardinality q, it is known that Reed-Solomon codes have maximum blocklength q−1, and doubly extended Reed-Solomon codes only increase this blocklength by 2. In typical applications q=256, which in the storage application of Patel would in some instances lead to undesirable restrictions.
In U.S. Pat. No. 5,946,328 for “Method and Means for Efficient Error Detection and Correction in Long Byte Strings Using Integrated Interleaved Reed-Solomon Codewords” by Cox et al. and assigned to IBM, a method is disclosed whereby a block composed with a plurality of interleaved codewords is such that one of the codewords is constructed through a certain logical sum of the other codewords. The procedure indicated is claimed to further enhance the reliability of the stored data above the reliability levels attained by the patent of Patel U.S. Pat. No. 4,525,838. We note that the error detection/correction procedure is applied to blocks of stored data. This is because the main motivation for this invention is not to provide individual access to codewords of the block but rather to provide for an integrated interleaving of the block that is more efficient that that provided by Patel.
In U.S. Pat. No. 6,275,965 for “Method and Apparatus for Efficient Error Detection and Correction in Long Byte Strings Using Generalized, Integrated Interleaving Reed-Solomon Codewords” by Cox et. al. and assigned to IBM, the earlier U.S. Pat. No. 5,946,328 is further augmented with the capability of multilple codewords within a block benefiting from the shared redundancy when their own redundancy is insufficient to correct errors.
In U.S. Pat. No. 6,903,887 for “Multiple Level (ML), Integrated Sector Format (ISF), Error Correction Code (ECC) Encoding and Decoding Processes for Data Storage or Communication Devices and Systems” by Asano et al. and assigned to IBM, the idea of an integrated interleave in a sector is further extended with multiple levels of code to cover integrated sectors. We note that a change in terminology as come into effect in this patent whereby what was previously called a block in earlier patents is now identified with a sector together with its redundant checkbytes, and a group of sectors is now called a block. Using the new terminology, a notable aspect of the invention in discussion is that the basic unit of access of this storage memory is a sector (typically 512 bytes), and not the block of sectors to which the shared redundancy is applied, which differs from the previous cited inventions. This feature creates an issue with writing individual sectors to the storage device, the main cited problem being that such individual sector write operations need to be preceded by a read operation that reads the other sectors participating in the overall block, followed by an encoding and writing of the entire block back to the storage. This is referred to as the “Read-Modify-Write” (RMW) problem and is highlighted as an undesirable problem that can potentially reduce the performance of hard disks. The Asano et al. patent addresses this problem through its multiple levels of coding whereby in some instances protection by higher levels is disabled but a certain level of reliability is maintained by the lower levels of coding (for example, by coding within the sector as discussed by earlier patents). Another aspect of the Asano et al. patent is that redundant check bytes computed for a block are computed using only certain summations of check bytes at the sector-level (as opposed to the actual data contents of sectors), which is cited as a key property that enables high performance drive performance by avoiding the need to have the entire sector data present during the check computations.
As we shall see, our invention's preferred embodiment is concerned with memories that are used as the first main level of storage in a computer system, although they are also applicable to microprocessor caches and other settings. As such, distinct considerations are of the essence. In one aspect of this invention beyond those already stated, our coding techniques enable a memory with the capacity of executing an efficient Read-Modify-Write operation. In another aspect of this invention, novel error control coding techniques are disclosed that have the desirable property that the minimum distance of the second level code can exceed twice the minimum distance of the first level code yet pay the smallest possible theoretical cost in terms of allocated redundant resources (the minimum distance of a code is a technical term that is often used to describe an important aspect of the error correction capacity of a code). In a third aspect of this invention, subline accesses are employed during common system operation but line accesses are employed during a process that is commonly known as “memory scrubbing”, whereby a background system process periodically scans the memory to read and write back the contents, thereby preventing the accumulation of errors in the memory.