The invention relates to a memory subsystem and in particular, to providing fault detection and correction in a memory subsystem.
Computer memory subsystems have evolved over the years, but continue to retain many consistent attributes. Computer memory subsystems from the early 1980's, such as the one disclosed in U.S. Pat. No. 4,475,194 to LaVallee et al., of common assignment herewith, included a memory controller, a memory assembly (contemporarily called a basic storage module (BSM) by the inventors) with array devices, buffers, terminators and ancillary timing and control functions, as well as several point-to-point busses to permit each memory assembly to communicate with the memory controller via its own point-to-point address and data bus. FIG. 1 depicts an example of this early 1980 computer memory subsystem with two BSMs, a memory controller, a maintenance console, and point-to-point address and data busses connecting the BSMs and the memory controller.
FIG. 2, from U.S. Pat. No. 5,513,135 to Dell et al., of common assignment herewith, depicts an early synchronous memory module, which includes synchronous dynamic random access memories (DRAMs) 8, buffer devices 12, an optimized pinout, an interconnect and a capacitive decoupling method to facilitate operation. The patent also describes the use of clock re-drive on the module, using such devices as phase lock loops (PLLs).
FIG. 3, from U.S. Pat. No. 6,510,100 to Grundon et al., of common assignment herewith, depicts a simplified diagram and description of a memory system 10 that includes up to four registered dual inline memory modules (DIMMs) 40 on a traditional multi-drop stub bus channel. The subsystem includes a memory controller 20, an external clock buffer 30, registered DIMMs 40, an address bus 50, a control bus 60 and a data bus 70 with terminators 95 on the address bus 50 and data bus 70.
FIG. 4 depicts a 1990's memory subsystem which evolved from the structure in FIG. 1 and includes a memory controller 402, one or more high speed point-to-point channels 404, each connected to a bus-to-bus converter chip 406, and each having a synchronous memory interface 408 that enables connection to one or more registered DIMMs 410. In this implementation, the high speed, point-to-point channel 404 operated at twice the DRAM data rate, allowing the bus-to-bus converter chip 406 to operate one or two registered DIMM memory channels at the full DRAM data rate. Each registered DIMM included a PLL, registers, DRAMs, an electrically erasable programmable read-only memory (EEPROM) and terminators, in addition to other passive components.
As shown in FIG. 5, memory subsystems were often constructed with a memory controller connected either to a single memory module, or to two or more memory modules interconnected on a ‘stub’ bus. FIG. 5 is a simplified example of a multi-drop stub bus memory structure, similar to the one shown in FIG. 3. This structure offers a reasonable tradeoff between cost, performance, reliability and upgrade capability, but has inherent limits on the number of modules that may be attached to the stub bus. The limit on the number of modules that may be attached to the stub bus is directly related to the data rate of the information transferred over the bus. As data rates increase, the number and length of the stubs must be reduced to ensure robust memory operation. Increasing the speed of the bus generally results in a reduction in modules on the bus with the optimal electrical interface being one in which a single module is directly connected to a single controller, or a point-to-point interface with few, if any, stubs that will result in reflections and impedance discontinuities. As most memory modules are sixty-four or seventy-two bits in data width, this structure also requires a large number of pins to transfer address, command, and data. One hundred and twenty pins are identified in FIG. 5 as being a representative pincount.
FIG. 6, from U.S. Pat. No. 4,723,120 to Petty, of common assignment herewith, is related to the application of a daisy chain structure in a multipoint communication structure that would otherwise require multiple ports, each connected via point-to-point interfaces to separate devices. By adopting a daisy chain structure, the controlling station can be produced with fewer ports (or channels), and each device on the channel can utilize standard upstream and downstream protocols, independent of their location in the daisy chain structure.
FIG. 7 represents a daisy chained memory bus, implemented consistent with the teachings in U.S. Pat. No. 4,723,120. A memory controller 111 is connected to a memory bus 315, which further connects to a module 310a. The information on memory bus 315 is re-driven by the buffer on module 310a to a next module, 310b, which further re-drives the memory bus 315 to module positions denoted as 310n. Each module 310a includes a DRAM 311a and a buffer 320a. The memory bus 315 may be described as having a daisy chain structure with each bus being point-to-point in nature.
A variety of factors including faulty components and inadequate design tolerances may result in errors in the data being processed by a memory subsystem. Errors may also occur during data transmission due to “noise” in the communication channel (e.g., the bus 315). As a result of these errors, one or more bits, which may be represented as X, which are to be transmitted within the system, are corrupted so as to be received as “/X” (i.e., the logical complement of the value of X). In order to protect against such errors, the data bits may be coded via an error correcting code (ECC) in such a way that the errors may be detected and possibly corrected by special ECC logic circuits. A typical ECC implementation appends a number of check bits to each data word. The appended check bits are used by the ECC logic circuits to detect errors within the data word. By appending bits (e.g., parity bits) to the data word, each bit corresponding to a subset of data bits within the data word, the parity concepts may be expanded to provide the detection of multiple bit errors or to determine the location of single or multiple bit errors. Once a data bit error is located, a logic circuit may be utilized to correct the located erroneous bit, thereby providing single error correction (SEC). Many SEC codes have the ability to detect double errors and are thus termed SEC double error detecting (SEC-DED) codes.
FIG. 8 represents a typical parallel bus ECC structure that transfers a complete ECC word in a single cycle. The structure depicted in FIG. 8 is consistent with the teachings in U.S. Pat. No. 6,044,483 to Chen et al., of common assignment herewith. FIG. 8 depicts an 88/72 ECC for computer systems having an eight bit per chip memory configuration. The lines labeled “Wire 0” through “Wire 72” each represent a wire on the memory bus 315 with seventy-two wires. For a memory subsystem with an eight bit per chip memory configuration, sixty-four bits of data and eight ECC bits are transferred every cycle. The ECC word is transferred entirely in one cycle, and a SEC-DED code may be utilized to correct any single bit failure anywhere in the ECC word, including a hard wire or bitlane failure. In the case of a hard wire or bitlane failure, every transfer has the same bitlane in error with the ECC correcting it for each transfer.
FIG. 9 depicts a typical manner of defining symbol ECCs for use in fault detection and correction in a memory subsystem. FIG. 8 is consistent with the teachings of U.S. Pat. No. 6,044,483. As shown in FIG. 9, the symbols are four bits in length and the symbols are defined across bitlanes. As is known in the art, a symbol refers to a mathematical derivation of ECC and corresponds to a group of bits that the ECC is able to correct either individually or as a group. Referring to FIG. 9, assuming that data bits one through four are sourced from the same memory chip, respectively, data errors located by “symbol 1” can be localized to a particular memory chip (e.g., a DRAM).
Busses that are protected by ECC are typically run as single transfer busses with a SEC-DED code. In other words, any single bitlane failure is corrected by the SEC code because the ECC word is completely transmitted in one cycle (or shot or transfer). Thus, if a wire, contact, or bitlane is faulty, it would be a faulty bit in every transfer, and the SEC ECC will correct the error each cycle.
Defining symbols across bitlanes may be used to effectively isolate errors to memory chips when a relatively wide parallel ECC structure is implemented and a complete ECC word is transferred in a single cycle. However, defining symbols across bitlanes may not be effective in isolating errors to a particular memory chip or bus wire when a relatively narrow parallel interface is implemented with the ECC word (made up of data bits and ECC bits) being delivered in packets over multiple cycles.