During the last several decades, memory technology has progressed dramatically. The density of commercial memory devices, taking Dynamic Random Access Memory (DRAM) as a convenient example, has increased from 1 Kbit to 64 Mbits per chip, a factor of 64,000. Unfortunately, memory device performance has not kept pace with increasing memory device densities. In fact, memory device access times during the same time period has only improved by a factor of 5. By comparison, during the past twenty years, microprocessor performance has increased by several orders of magnitude. This growing disparity between the speed of microprocessors and that of memory devices has forced memory system designers to create a variety of complicated and expensive hierarchical memory techniques, such as Static Random Access Memory (SRAM) caches and parallel DRAM arrays. Further, now that computer system users increasingly demand high performance graphics and other memory hungry applications, memory systems often rely on expensive frame buffers to provide the necessary data bandwidth. Increasing memory device densities satisfy the overall quantitative demand for data with fewer chips, but the problem of effectively accessing data at peak microprocessor speeds remains.
Overlaying the problem of data access speed, some computer systems have particularly high requirements for availability and reliability. Central data processing systems at banks and financial institutions, Internet service providers, and telecommunications control systems are ready examples of computer systems which simply can not fail when accessed by a user. The inevitable occurrence of memory device failures within such computer systems has lead to the development of numerous methods and features whereby memory device failures are detected and corrected without shutting down the computer system. One such method is called “chipkill.”
Conventional chip-kill will be explained with reference to FIG. 1. FIG. 1 illustrates a conventional memory system with the architectural changes required to implement chip-kill. In FIG. 1, four memory devices 10 are arranged along a data bus 12. In the example, each memory device is a Dual In-Line Memory Module (DIMM) including 18 DRAMs, each DRAM communicating 4 data bits to/from data bus 12 (i.e., 18×4 DRAMs). For clarity, only the data line connections for a single DRAM are shown. This example assumes four (4) groups of 72 bits each (of which 64 bits are data to be returned to the requestor and 8 bits are used for error correction) are communicated by the memory system, thus transferring 256 bits of data, to a requester, normally a controller or microprocessor connected to the memory system. Notably, in the conventional chip-kill memory system two quantities of data are returned by each memory device during a read operation: (i) 16 bytes of data to be returned to the requester, and (ii) an 2 additional bytes of data used for error detection and correction. These additional 2 bytes of data are called “syndrome.”
Syndrome is used in error detection and correction algorithms to determine whether data from a given memory device contains one or more errors. Some algorithms merely detect the presence of data error(s). Other algorithms have the ability to actually correct one or more detected errors. Single-error-correct/double-error-detect (SECDED) algorithms are well understood by those of ordinary skill in the art. Many other conventional error detection and correction algorithms are known, but as a rule the requirement for additional bits of syndrome increases with the increasing sophistication of the algorithm, i.e., the ability of an algorithm to detect and correct data errors depends on the quantity of associated syndrome provided. For one type of SECDED algorithm, the relationship between data and associated syndrome is well known: the number of syndrome bits increases as the log of the number of data bits. So, 64 bits of data require 8 bits of syndrome, 128 bits of data require 9 bits of syndrome, 256 bits of data require 10 bits of syndrome, etc.
Returning to FIG. 1, each of the four memory device returns 18 bits of data. Thus, 288 bits (256 bits of data and 32 bits of syndrome) are actually read during a read operation. In the example, 8 bits of syndrome are applied to each one of four error correcting code (ECC) generators 14 along with 64 bits of data. Using a known SECDED algorithm, this is enough syndrome to detect up to two bit errors in the 64 bits of data, and correct one bit error.
By having each DRAM in the example supply one bit of data to each ECC generator, the failure of one DRAM can be tolerated since each ECC generator will detect and be able to correct the resulting bit error. Once error detection and correction is complete each ECC generator 14 strips syndrome from the data and communicates the data to the requestor. During a write operation, the opposite flow of data occurs. A 256 bit block of data is presented by the requestor to the memory system and divided between ECC generators 14 into separate 64 bit blocks of data. Each ECC generator computes the required syndrome bit values and adds syndrome data to the 64 bits of data. The resulting 72 bits data block is then stored in memory devices 10.
Error detection and correction by the ECC generators 14 is typically monitored within the computer system. Should any one DRAM fail, the system may “replace” the failed DRAM with a spare (not shown). This replacement process may be performed in background processing while the computer system remains available to users. In the unlikely event of simultaneous failures in two DRAMs, the computer system in the foregoing example could detect the two failures, but remedial action would require maintenance intervention. Such a happenstance would force a system shut-down or switch over to a back-up system. A more powerful error correction algorithm, one capable of correcting two bit errors, would avoid this event.
In sum, conventional memory systems implementing chip-kill read and write both data and syndrome to an ECC generator(s) during each operation. Further, the amount of syndrome furnished by each DRAM to individual ECC generators is dependent on the type of error detection and correction algorithm being used by the computer system. More powerful error detection and correction algorithms require more syndrome bits.
As can be seen from the foregoing example, conventional memory systems use a large number of data lines, or a relatively wide bus. The term “line(s)” is used to describe the physical mechanism by which data bits are electronically communicated from one point to another in a system. A line may take the form, alone or in combination, of a printed circuit board (PCB) strip, metal contact, pin and/or via, microstrip, semiconductor channel, etc. A line may be single or may be associated with a bus. A “bus” is a collection, fixed or variable, of lines, and may also be used to describe the drivers, laches, buffers, and other elements associated with an operative collection of lines. A bus may communicate control information, address information, and/or data. In the foregoing example, four sets of 72 data bit lines connect the memory devices 10 and ECC generators 14. On the other side of the ECC generators, four sets of 64 data bit lines combine to form a 256 bit wide data bus.
Such massively parallel, or wide buses, are required in conventional memory systems due to the slow access speed of memory devices. Wide buses have long been associated with implementation and performance problems, such as excessive power consumption, slow speed, loss of expandability and design flexibility, etc. Thus, various attempts have been made to effectively use relatively narrower buses. In one common approach, packets of data larger than the width of the bus are divided into portions, and the resulting portions are then transmitted over a number of cycles.
Transmission of data over a number of cycles does allow reduction of the bus size. It also greatly increases system complexity. Such complexity often results in memory system rigidity. That is, once implemented in all its complexity, the integration of a new function into the memory system becomes extremely difficult. In particular, memory system designers continue to face enormous challenges in increasing data throughput while minimizing system complexity, and maintaining system reliability.