In recent years as semiconductor geometries have shrunk, each subsequent generation has become increasingly costly to develop and bring into production. This makes the commercial demand for more memory at ever lower prices per bit harder for memory manufacturers to meet. One current solution is to expand upward by stacking memory chips one atop another in a single package. These memories can be coupled in a variety of technologies known in the art such as, for example, wire bonding, through-silicon via (TSVs), and the like.
While this can greatly increase the memory density in terms of bits per package footprint area, it creates additional problems that must be solved to create a commercially successful product. One such problem is the presence of defective memory chips in a stack. If a chip is tested and known to fail before assembly, it is easy to discard and replace it with a fully functional chip. Once the chips are packaged, if a chip fails then the entire stack can become defective—especially if there is insufficient in-field repair capability to repair or work around the bad chip. In such a case, all the good die in the stack may be discarded along with the bad one. This can be a particular difficulty in volatile memories like, for example, static random access memories (SRAM) and dynamic random access memories (DRAM). This is because these memories are often used as caches and the main memory for a processor, and programming models and operating systems assume that the entire installed memory space is fully functional.
FIG. 1 illustrates a representative DRAM integrated circuit (IC) 100 of a type known in the art. DRAM IC 100 comprises a memory array 102 of memory cells where individual bits of memory are stored. Memory array 102 is coupled to bit line/sense amplifier (BLSA) 104. The circuits in BLSA 104 are well known in the art and provide the means for addressing columns and for writing data into and reading data from memory array 102.
BLSA 104 is coupled to bidirectional input/output (I/O) bus 106 to allow data to be written to or read from DRAM IC 100 by the system in which DRAM IC 100 is operating. Memory array 102 is further coupled to word line (WL) drivers 108 which are used for addressing rows for reading and writing operations.
The conventional way of increasing yield is to provide a certain number of additional, or redundant, rows and columns that can be switched in to replace defective rows and columns respectively. In FIG. 1, redundant rows 110 and redundant columns 112 are shown as part of memory array 102. The redundant memory cells in area 114 are the intersection of redundant rows 110 and redundant columns 112 as they are part of both.
Typically, there is an overhead of about 3% redundant rows (e.g., 132/128=1.0313) and about 3% redundant columns for a memory cell overhead of about 6% (1.03132=1.0635). In practice, all of the redundant bits are usually swapped in before the memory device is shipped to a customer. First defective bits that are tested and found non-functional are replaced; then the weakest bits that are identified by further testing are replaced with the remaining redundant bits to increase reliability. Typically, no in-field repair is done for individual commercial memory chips.
The circuitry in area 116 is used for controlling memory array 102 and its related circuits. Persons skilled in the art will realize that DRAM IC 100 is an abstraction and that many necessary circuits known in the art are omitted for simplicity.
FIG. 2 illustrates a representative memory module of a type known in the art as a Registered Dual In-line Memory Module (RDIMM) with Error Correction Coding (ECC). Memory module 200 comprises a printed circuit board (PCB) 202 with two portions 204A and 204B comprising two parts of an edge connector separated by an alignment notch 206. Memory module 200 is typically inserted into a socket (not shown) on another PCB (not shown) and the edge connector portions 204A and 204B provide electrical connections to the rest of the system (not shown). Alignment notch 206 is placed off-center to prevent incorrect insertion into the socket.
PCB 202 has a variety of integrated circuits mounted thereon in addition to many passive components (not shown) like, for example, decoupling capacitors. There are DRAM ICs 208, ECC ICs 210, a Serial Presence Detection (SPD) EEPROM 212, and a register IC (REG) 214. The DRAMs 208 are of a type known in the art like, for example, DRAM IC 100 in FIG. 1. Typically, in an RDIMM with ECC the DRAMs 208 are partitioned into nine groups each having an associated ECC IC 210. This allows storage of a 72/64 Hamming ECC as known in the art. Each 64-bit data word is encoded into a 72-bit data word with eight parity bits which is sufficient to perform a Single Error Correction Double Error Detection (SECDED) for each original 64-bit data word. The parity bits are stored in the additional memory capacity provided by the ninth group of DRAMs 208. The ECC works to increase the reliability of the RDIMM beyond the reliability imparted due to the redundant rows and columns in the DRAM ICs 208.
Each group of DRAMs typically comprises one, two, or four DRAM ICs 208 depending on the capacities of the individual DRAMs 208 and the desired capacity of the RDIMM 200 itself. In the example in FIG. 2, two DRAMs 208 are shown mounted on the front of PCB 202 for each of the nine groups. Optionally there could be another two DRAMs 208 (not shown) mounted on the rear of PCB 202. Likewise, there could be only a single DRAM 208 associated with each of the nine ECC ICs 210.
SPD EEPROM 212 is typically present in JEDEC standard Dual In-line Memory Modules (DIMM) of all types as is known in the art. SPD 212 allows the memory controller to serially access data stored in the EEPROM concerning the type of DIMM present in any socket and use the data to properly control it. The register IC 214 is used to ease timing constraints by pipelining read and write data. It is present in RDIMMs (hence the “R” in RDIMM) as well as other types of DIMM.
FIG. 3 illustrates a subsystem 300 comprising an applications processor 302 in a package and a Low Power Double Data Rate 4 (LPDDR4) DRAM 304 in a Package on Processor (PoP) package. The DRAM 304 package is itself mounted on the applications processor 302 package as known in the art. Subsystem 300 is suitable for use in an information processing device like, for example, a cell phone or a tablet computer. LPDDR4 DRAM 304 is shown comprising error correction circuit 306 coupled between memory array 308 and application processor 302.
The JEDEC Low Power Double Data Rate (LPDDR4) Standard (JESD209-4A, November 2015) includes a masked write command (MWR). This command takes longer to complete than a normal write which allows extra time to access an entire data word, replace the old data with new in bytes to be overwritten while keeping the old data for bytes to be masked, recalculate the parity bits for the entire data word, and then write the entire data word plus parity back into the memory. While no LPDDR4 products with the ECC feature have yet appeared in the market, prototypes have been discussed in the literature and the possibility of using the MWR command this way is mentioned in the JEDEC Standard.
Like memory array 102 in FIG. 1, memory array 308 also has redundant rows and columns that are used in a substantially similar manner. Error correction circuit 306 works to increase the reliability of subsystem 300 beyond the reliability imparted due to the redundant rows and columns in memory array 308.
FIG. 4 illustrates an abstraction of a stacked memory device 400 in a single package (not shown). Stacked memory device 400 comprises a controller IC 402 on which an exemplary stack of four DRAM ICs 404A, 404B, 404C and 404D are mounted (though there may be other numbers of DRAM ICs 404 as a matter of design choice).
Controller IC 402 and DRAM ICs 404A, 404B, 404C and 404D are electrically coupled together vertically using Through Silicon Via (TSV) interconnects, an exemplary one of which couples to controller IC 402 at 406A, couples to the top DRAM IC 404D at 406B, and couples to DRAM ICs 404A, 404B and 404C in between. Although other interconnect technologies could be used for interconnection in stacked memory device 400, TSV seems to be the technology that the major memory manufacturers are pursuing for higher density memories in products such as the Hybrid Memory Cube and High Bandwidth Memory.
FIG. 5 illustrates an abstract view of a Hybrid Memory Cube (HMC) product 500 according to the Hybrid Memory Cube Specification 2.1, October 2015, published by the Hybrid Memory Cube Consortium. HMC product 500 comprises a single package (not shown) with a base logic IC 502 and four stacked DRAMs 504A, 504B, 504C and 504D and is organized into a number of vertical partitions known as vaults 506. The base logic IC 502 comprises a vault controller 508 for each vault which in turn manages all of its associated vault DRAM partitions 510A, 510B, 510C and 510D located on DRAMs 504A, 504B, 504C and 504D respectively. Communication between the vault controller 508 and the DRAM partitions 510A, 510B, 510C and 510D is achieved with wide vertical busses implemented in TSVs (not shown), while communication between the vault controllers 508 and the system (not shown) is implemented with high speed serial links (not shown).
Although details are scarce in the literature, there are a number of enhanced reliability features in HMC product 500. Each vault is capable of self-repair and has Hamming ECC. This implies that a higher percentage of redundant rows and columns are present in the DRAM partitions 510A, 510B, 510C and 510D than in conventional DRAM 100 in FIG. 1. Thus the ECC can enhance reliability by covering for a bit that fails during normal operation until the vault controller can schedule a self-repair operation. Effectively the ECC does double duty by covering for soft errors as in RDIMM 200 or DRAM 300 discussed above, as well as temporarily masking hard errors until repaired.
HMC product 500 also contains self-repair capability if a vertical TSV bus line fails by allocating redundant TSV bus lines. Additionally, the HMC product 500 performs parity checking on address and command lines to the DRAM partitions 510A, 510B, 510C and 510D, which allows vault controller 508 to retry read and write operations incorrectly received by one of the memory partitions. The vault controller 508 also does diagnostics on the high speed links to either correct any problem or, in the worst case, shut the link down.
FIG. 6 illustrates an abstract view of a High Bandwidth Memory (HBM) product 600 according to the JEDEC High Bandwidth Memory (HBM) DRAM Standard JESD235A, November 2015. HBM product 600 comprises a base logic IC 602 and four DRAM ICs 604A, 604B, 604C and 604D each comprising memory array 606A, 606B, 606C and 606D respectively in a single package (not shown). According to the Standard, base logic IC 602 is optional and its functionality can be located outside the package elsewhere in the system (not shown). Vertical communication is implemented with vertical bus lines using TSVs (not shown).
While few details are given in the literature there are a number of enhanced reliability features in HBM product 600. It has self-repair capability implying a higher percentage of redundant rows and columns in memory arrays 606A, 606B, 606C and 606D than in conventional DRAM 100 in FIG. 1. HBM product 600 supports ECC by providing 16 additional bits per 128 data bits, though the ECC computations are done in the host processor. This number of bits is sufficient to implement two 72/64 Hamming ECCs, or a more sophisticated ECC scheme operating on all 128 bits as known in the art could be used as a function of the software. HBM product 600 also contains self-repair capability if a vertical TSV bus line fails by allocating redundant vertical bus lines.
RAID (redundant array of independent disks) is a venerable technology used to guard against data loss in the event of hard disk failures in high end computers and data centers. The use of RAID-style technology has been mentioned in the literature as an area for investigation to improve the reliability of high density memory products, but no embodiments or methods of use have been disclosed.
RAID actually covers a wide variety of different techniques (some standardized and some proprietary) that provide differing degrees of reliability at different price points. The three most commonly encountered are the standardized RAID 1, RAID 5 and RAID 6.
RAID 1 is often called disk mirroring. Two hard disk drives (HDDs) are controlled in parallel with the same data written to and read from both. If one of the HDDs fails, it can be replaced and then the data can be transferred to the new HDD from the other old HDD. There is a risk of data loss if the second old HDD fails before the data is transferred. This is a relatively inexpensive reliability feature, which can be found typically in business PCs and workstations.
RAID 5 requires at least three HDDs to function: two data disks and one parity disk, though additional data HDDs may be added. The parity data is a bit-by-bit XOR of all the data on all the data disks which is then stored on the parity disk. If any of the disks fails, it can be replaced with the system on and active (a so-called “hot-swap”) and reconstructed without data loss and without stopping or powering down the system. Data loss can occur if a second disk fails before the new disk is reconstructed. RAID 5 is a medium tier reliability feature disk arrays typically used by small-to-medium sized business.
FIG. 7 illustrates an abstraction of a RAID 5 disk array 700 as known in the art. This is the simplest case RAID 5 configuration. Disk array 700 comprises two data HDDs 702 and 704 and a parity HDD 706. The three tables 712, 714 and 716 show the bit-by-bit relationship between individual bits on each drive. Initially, the data on parity disk 706 would be created by applying an XOR function bit-by-bit for all the data on data disks 702 and 704. Once the parity data on disk 706 has been created, notice that any data bit on any of the three HDDs 702, 704 and 706 can be reconstructed from the other two disks by performing a bit-by-bit XOR of the good bits and writing them bit-by-bit to the newly replaced disk. This is due to the parity preservation property of the XOR function. Notice that the parity of each of the three rows across tables 712, 714 and 716 is even (that is, there is an even number of logic-1 bits present: 0, 2, 2 and 2 from top to bottom). This is true for any number of inputs to the XOR function. So if there is an error, the correct data for each bit on the replacement disk is the binary value that will make the overall parity even.
RAID 6 requires at least four HDDs to function: two data disks and two parity disks, though additional data HDDs may be added. In this double parity scheme, one of the parity disks is created as per RAID 5, while the second parity disk is created using a different parity algorithm. This arrangement allows any two HDDs to fail without losing data or availability. The use case is to allow the system operator to quickly hot-swap a failed disk while still maintaining redundancy should a second disk fail during the recreation of the first new disk. Thus it would take three simultaneous disk failures for the disk array to fail, a highly unlikely event. RAID 6 is a high-end technique typically found in enterprise class disk arrays, in data centers, and applications where data loss or inaccessibility is unacceptable.