The present invention relates to the electrical, electronic and computer arts, and, more particularly, to operation of one or more memory chips.
Computers often include one or more DIMMs (dual in-line memory modules) having SDRAM (synchronous dynamic random-access memory) chips (i.e., integrated circuits) mounted thereon. SDRAM chips (and, by extension, DIMMs) are usually classified either as “×8,” with each SDRAM chip having 8 DQs (data lines), or as “×4,” with each SDRAM chip having 4 DQs. Burst length refers to the number of bits read or written to each DQ of an SDRAM chip during any access.
Computers often include trillions of bits of RAM (random-access memory). Yet, an unmitigated error in even a single one of these bits can cause an application or computer to crash, and in some cases render the entire computer inoperable until it is repaired using spare parts, which can entail significant down time. Thus, the probability of encountering a RAM failure during normal operations increases as computers grow more powerful and include more RAM. Moreover, the device size on RAM chips has grown smaller, such that circuit sizes are approaching physical limits, resulting in new failure modes such as variable retention time errors.
Accordingly, it has become increasingly common for RAM to store ECC (error correction code) symbols in addition to data words. By way of example, many DIMMs provide a 72-bit interface composed of 64 data bits and 8 ECC checksum bits, thereby allowing the use of a (72,64) Hamming code for SEC/DED (single error correction, double error detection). However, SEC/DED only permits correction of a single-bit permanent (hard) or transient (soft) error within a given data word.
However, SEC/DED cannot correct any errors affecting multiple bits in a given data word, such as those caused by failure of one of the RAM chips on a DIMM (e.g., a “chip kill”). Indeed, it would be desirable for a system to be sufficiently robust to survive failure of multiple RAM chips on a DIMM (e.g., a “double chip kill”) or even failure of an entire DIMM (e.g., a “DIMM kill”). Preferably, the system can survive these catastrophic failures by either allowing for replacement of failed parts concurrent with system operation or providing sufficient resilience and redundancy so the system can remain fully operational and no repair is necessary.
However, the number of errors that can be detected and corrected is directly related to the length of the ECC field appended to the data word. The aforementioned SEC/DED only permits correction of a single-bit permanent (hard) or transient (soft) error within a given data word, yet reduces computer usable memory space by ˜11% (8 bits out of 72 bits) to facilitate storage of the ECC.
By contrast, more powerful error correction and recovery techniques often entail additional loss of customer usable storage space (e.g., to hold longer ECCs). For example, memory mirroring is a relatively simple approach which involves keeping duplicate copies of data on two or more different devices (e.g., different chips and/or different DIMMs). Mirroring enables a system to survive catastrophic memory failures, but acceptance has been very low because it requires doubling error-correction memory requirements even beyond those necessitated by the base SEC/DED ECC (e.g., the aforementioned 11%), such that mirroring leaves customers with less than 50% of the available RAM usable for the customer's software.
DDR (dual data rate) collectively refers to a series of standards for SDRAM promulgated by the Joint Electron Device Engineering Council (JEDEC) Solid State Technology Association. Successive generations of DDR memory often seek to improve performance by increasing the standard burst length, thus increasing the number of data symbols supplied during any access. For example, DDR1 was published in 2000, with a standard burst length of 2. Thus, each DDR1 ×4 chip supplies 2 nibbles (2×4 bits) on each access, while each DDR1 ×8 chip supplies 2 bytes (2×8 bits) on each access. DDR2 was published in 2003, and increased the standard burst length to 4. DDR3 was published in 2007, and DDR4 was published in 2014, both with a standard burst length of 8. However, increased burst length means that failure of a SDRAM chip will affect a greater number of symbols. The more data and ECC symbols that are lost during a failure, the more total ECC symbols are needed to correct and recover the missing data.
As previously noted, many DIMMs provide a 72-bit interface composed of 64 data bits and 8 metadata (e.g., ECC checksum) bits. A DDR3 or DDR4 DIMM containing 9 ×8 chips—8 data chips and 1 checksum chip—will support 64 bytes of data and 8 bytes of checksum metadata. However, ×4 chips provide half of the data per access as ×8 chips, so twice as many ×4 chips are required to provide the 72-bit DIMM interface. Thus, a DDR3 or DDR4 DIMM constructed with ×4 chips will require 16 data chips and 2 checksum chips in order to support 64 bytes of data and 8 bytes of checksum metadata.
The storage capacity for ×8 and ×4 chips is the same, such that doubling the number of ×4 chips relative to ×8 chips will in turn double the total storage capacity for the DIMM. However, as noted above, ×4 DIMMs have twice as many chips as ×8 DIMMs, and therefore a chip failure in a ×4 DIMM will affect only half as many bits as a chip failure in a ×8 DIMM. Moreover, it is typically not possible to provide chip-kill correction in a ×8 DIMM with only one checksum DRAM because if any of the DRAMs fail completely too many symbols of the code word are lost to allow for chip-kill correction thus the data is lost. Thus, with all else equal, a ×4 chip kill is easier to recover from than a ×8 chip kill because only half as many bits are in errors for the ×4 chip. The more data and ECC symbols that are lost during a failure, the more total ECC symbols are needed to correct and recover the missing data.
Even though successive generations of DDR memory have increased burst length in order to increase memory bandwidth, there are countervailing constraints with respect to CPU (central processing unit) design. For example, it is often desirable to keep CPU architecture stable, which often has caused CPU cache line sizes to remain the same rather than keeping pace with the increasing memory burst length. In some cases, CPU cache line sizes have even shrunk, as the number of processor cores increases, in order to maintain and improve system performance characteristics.
A memory controller can spread data associated with a cache line over multiple DIMMs by simultaneously accessing multiple memory channels. Additionally and/or alternatively, each memory channel may be coupled to multiple DIMMs arranged in respective cascades. Utilizing multiple DIMMs across multiple memory channels tends to help increase bandwidth improving performance. Doubling the number of DIMMs utilized doubles the number of physical components which, with all else equal, will increase the error correction effectiveness. Thus, multi-channel designs also typically have better error correction characteristics because there are more data and checksum symbols to utilize on any access, which will result in stronger, more robust error correction codes. However, as DDR burst lengths increase, it can be difficult to maintain or shrink cache line size with multi-channel designs because more data is handled on every access.
In a single-rank DIMM, all DRAMs on the DIMM provide storage for a portion of the data during a given access. However, DIMMs can also be designed with multiple ranks. For example, a second rank could be added by doubling the number of DRAMs on a DIMM, thus doubling the storage capacity of the DIMM. Because only one rank at a time is active on any access, additional ranks are like having multiple logical DIMMs reside on a single physical DIMM.
FIG. 1 is a simplified diagram showing a memory architecture found in the IBM System z mainframe commercially available from International Business Machines, the assignee of the present invention. As shown in FIG. 1, this architecture utilizes 5 memory channels accessed in unison. Note that in FIG. 1, MCU is the memory controller unit, and SN is a SuperNova memory buffer chip on each DIMM coupled to the DRAM chips on that DIMM. This architecture has a 256 byte cache line which, when used with ×8 DIMMs having a burst length of 8 (e.g., DDR3 or DDR4), can be segmented into 4 parts and distributed across 4 memory channels. The fifth memory channel is used to implement a feature known as RAIM (redundant array of independent memory) parity. Each memory access results in 64 bytes of data and 8 bytes of checksum (e.g., checkbits) from each of the first 4 memory channels, and an additional 72 bytes of checksum (e.g., RAIM parity) from the fifth memory channel. Thus, each memory access results in 256 (64×4) bytes of data and 104 (8×4+72) bytes of ECC checksum. This allows for a sufficient ECC strength to allow for single chip kill correction, double chip kill correction, full DIMM kill correction, or even a full memory channel failure.
However, other computer architectures are different from System z: some have smaller cache sizes that do not lend themselves to RAIM type collection. For example, IBM System p servers have a 128 byte cache line, while Intel ×86 servers typically have a 64 byte cache line. The systems also typically have fewer memory channels than the IBM System z. In fact, some newer designs employ single independent memory channel cache line access. Both System p and Intel ×86 systems cannot provide full DIMM kill correction without requiring mirroring. Intel ×86 systems cannot even provide single chip kill correction using industry-standard ×8 DRAM DIMMs; chip kill can sometime be provided with ×4 DRAM DIMMs, but often involves unusual modes of operation that adversely impact system performance.