1. Field of the Invention
The present invention relates generally to electronic systems for data storage and retrieval. More particularly, the invention is directed toward improved methods and structures for memory devices.
2. Description of the Related Art
In any engineered design there are compromises between cost and performance. The present invention introduces novel methods and structures for reducing the cost of memory devices while minimally compromising their performance. The description of the invention requires a significant amount of background including: application requirements, memory device physical construction, and memory device logical operation.
Memory device application requirements can be most easily understood with respect to memory device operation. FIG. 1 shows the general organization of a memory device. Memory device 101 consists of a core 102 and an interface 103. The core is responsible for storage of the information. The interface is responsible for translating the external signaling used by the interconnect 105 to the internal signaling carried on bus 104. The primitive operations of the core include at least a read operation. Generally, there are other operations required to manage the state of the core 102. For example, a conventional dynamic random access memory (DRAM) has at least write, precharge, and sense operations in addition to the read operation.
For purposes of illustrating the invention a conventional DRAM core will be described. FIG. 2 is a block diagram of a conventional DRAM core 102. Since the structure and operation of a conventional DRAM core is well known in the art only a brief overview is presented here.
A conventional DRAM core 202 mainly comprises storage banks 211 and 221, row decoder and control circuitry 210, and column data path circuit comprising column amplifiers 260 and column decoder and control circuitry 230. Each of the storage banks comprises storage arrays 213 and 223 and sense amplifiers 212 and 222.
There may be many banks, rather than just the two illustrated. Physically the row and column decoders may be replicated in order to form the logical decoder shown in FIG. 2. The column i/o lines 245 may be either bidirectional, as shown, or unidirectional, in which case separate column i/o lines are provided for read and write operations.
The operation of a conventional DRAM core is divided between row and column operations. Row operations control the storage array word lines 241 and the sense amplifiers via line 242. These operations control the movement of data from the selected row of the selected storage array to the selected sense amplifier via the bit lines 251 and 252. Column operations control the movement of data from the selected sense amplifiers to and from the external data connections 204d and 204e. 
Device selection is generally accomplished by one of the following choices:                matching an externally presented device address against an internally stored device address;        requiring separate operation control lines, such as RAS and CAS, for each set of memory devices that are to be operated in parallel; and        providing at least one chip select control on the memory device.        
FIG. 3 illustrates the timing required to perform the row operations of precharge and sense. In their abstract form these operations can be defined as                precharge (device, bank)—prepare the selected bank of the selected device for sensing; and        sense (device, bank, row)—sense the selected row of the selected bank of the selected device.        
The operations and device selection arguments are presented to the core via the PRECH and SENSE timing signals while the remaining arguments are presented as signals which have setup and hold relationships to the timing signals. Specifically, as shown in FIGS. 2-4, PRECH and PRECHBANK form signals on line 204a in which PRECHBANK presents the “bank” argument of the precharge operation, while SENSE, SENSEBANK and SENSEROW form signals on line 204b in which SENSEBANK and SENSEROW present the “bank” and “row” arguments, respectively, for the sense operation. Each of the key primary row timing parameters, tRP, tRAS,min, and tRCD can have significant variations between devices using the same design and across different designs using the same architecture.
FIG. 5 and FIG. 6 illustrate the timing requirements of the read and write operations, respectively. These operations can be defined abstractly as:                data=read (device, bank, column)—transfer the data in the subset of the sense amplifiers specified by “column” in the selected “bank” of the selected “device” to the READDATA lines; and        write (device, bank, column, mask, data)—store the data presented on the WRITEDATA lines into the subset of the sense amplifiers specified by “column” in the selected “bank” of the selected “device”; optionally store only a portion of the information as specified by “mask”.        
More recent conventional DRAM cores allow a certain amount of concurrent operation between the functional blocks of the core. For example, it is possible to independently operate the precharge and sense operations or to operate the column path simultaneously with row operations. To take advantage of this concurrency each of the following groups may operate somewhat independently:                PRECH and PRECHBANK on lines 204a;         SENSE, SENSEBANK, and SENSEROW on lines 204b;         COLCYC 204f on line, COLLAT and COLADDR on lines 204g, WRITE and WMASK one lines 204c, READDATA on line 204d, and WRITEDATA on line 204.        
There are some restrictions on this independence. For example, as shown in FIG. 3, operations on the same bank observe the timing restrictions of tRP and tRAS,min. If accesses are to different banks, then the restrictions of FIG. 4 for tSS and tPP may have to be observed.
The present invention, while not limited by such values, has been optimized to typical values as shown in Table 1.
TABLE 1Typical Core Timing ValuesSymbolValue (ns)tRP20tRAS,Min50tRCD20tPP20tSS20tPC10tDAC 7
FIG. 7 shows the permissible sequence of operations for a single bank of a conventional DRAM core. It shows the precharge 720, sense 721, read 722, and write 723, operations as nodes in a graph. Each directed arc between operations indicates an operation which may follow. For example, arc 701 indicates that a precharge operation may follow a read operation.
The series of memory operations needed to satisfy any application request can be covered by the nominal and transitional operation sequences described in Table 2 and Table 3. These sequences are characterized by the initial and final bank states as shown in FIG. 8.
The sequence of memory operations is relatively limited. In particular, there is a universal sequence:                precharge,        sense,        transfer (read or write), and        close.        
In this sequence, close is an alternative timing of precharge but is otherwise functionally identical. This universal sequence allows any sequence of operations needed by an application to be performed in one pass through it without repeating any step in that sequence. A control mechanism that implements the universal sequence can be said to be conflict free. A conflict free control mechanism permits a new application reference to be started for every minimum data transfer. That is, the control mechanism itself will never introduce a resource restriction that stalls the memory requestor. There may be other reasons to stall the memory requester, for example references to different rows of the same bank may introduce bank contention, but lack of control resources will not be a reason for stalling the memory requestor
TABLE 2Nominal TransactionsInitial BankFinal BankTransactionOperationsStateStateTypePerformedclosedclosedemptysense,series of columnoperations,prechargeopenopenmissprecharge,sense,series of columnoperationshitseries of columnoperations
TABLE 3Transitional TransactionsInitial BankFinal BankTransactionOperationsStateStateTypePerformedclosedopenemptysense,<series of columnoperations>(optional)openclosedmiss<precharge,sense,series of columnoperations>(optional),prechargehit<series of columnoperations>(optional),precharge
Memory applications may be categorized as follows:                main memory—references generated by a processor, typically with several levels of caches;        graphics—references generated by rendering and display refresh engines; and        unified—combining the reference streams of main memory and graphics.        
Applications may also be categorized by their reference stream characteristics. According to the application partition mentioned above reference streams can be characterized in the following fashion:                First, main memory traffic can be cached or uncached processor references. Such traffic is latency sensitive since typically a processor will stall when it gets a cache miss or for any other reason needs data fetched from main memory. Addressing granularity requirements are set by the transfer size of the processor cache which connects to main memory. A typical value for the cache transfer size is 32 bytes. Since multiple memory interfaces may run in parallel it is desirable that the memory system perform well for transfer sizes smaller than this. Main memory traffic is generally not masked; that is, the vast bulk of its references are cache replacements which need not be written at any finer granularity than the cache transfer size.        Another type of reference stream is for graphics memory. Graphics memory traffic tends to be bandwidth sensitive rather than latency sensitive. This is true because the two basic graphics engines, rendering and display refresh, can both be highly pipelined. Latency is still important since longer latency requires larger buffers in the controller and causes other second order problems. The ability to address small quanta of information is important since typical graphics data structures are manipulated according to the size of the triangle being rendered, which can be quite small. If small quanta cannot be accessed then bandwidth will be wasted transferring information which is not actually used. Traditional graphics rendering algorithms benefit substantially from the ability to mask write data; that is, to merge data sent to the memory with data already in the memory. Typically this is done at the byte level, although finer level, e.g. bit level, masking can sometimes be advantageous.        
As stated above, unified applications combine the characteristics of main memory and graphics memory traffic. As electronic systems achieve higher and higher levels of integration the ability to handle these combined reference streams becomes more and more important.
Although the present invention can be understood in light of the previous application classification, it will be appreciated by those skilled in the art that the invention is not limited to the mentioned applications and combinations but has far wider application. In addition to the specific performance and functionality characteristics mentioned above it is generally important to maximize the effective bandwidth of the memory system and minimize the service time. Maximizing effective bandwidth requires achieving a proper balance between control and data transport bandwidth. The control bandwidth is generally dominated by the addressing information delivered to the memory device. The service time is the amount of time required to satisfy a request once it is presented to the memory system. Latency is the service time of a request when the memory system is otherwise devoid of traffic. Resource conflicts, either for the interconnect between the requester and the memory devices, or for resources internal to the memory devices such as the banks, generally determine the difference between latency and service time. It is desirable to minimize average service time, especially for processor traffic.
The previous section introduced the performance aspects of the cost-performance tradeoff that is the subject of the present invention. In this section the cost aspects are discussed. These aspects generally result from the physical construction of a memory device, including the packaging of the device.
FIG. 9 shows the die of a memory device 1601 inside of a package 1620. For typical present day device packages, the bond pads, such as 1610, have center to center spacing significantly less than the pins of the device, such as 1640. This requires that there be some fan-in from the external pins to the internal bonding pads. As the number of pads increases the length of the package wiring, such as 1630, grows. Observe that elements 1630 and 1640 are alternately used to designate package wiring.
There are many negative aspects to the increase in the length of the package wiring 1640, including the facts that: the overall size of the package increases, which costs more to produce and requires more area and volume when the package is installed in the next level of the packaging hierarchy, such as on a printed circuit board. Also, the stub created by the longer package wiring can affect the speed of the interconnect. In addition, mismatch in package wiring lengths due to the fan-in angle can affect the speed of the interconnect due to mismatched parasitics.
The total number of signal pins has effects throughout the packaging hierarchy. For example, the memory device package requires more material, the next level of interconnect, such as a printed circuit board, requires more area, if connectors are used they will be more expensive, and the package and die area of the master device will grow.
In addition to all these cost concerns based on area and volume of the physical construction another cost concern is power. Each signal pin, especially high speed signal pins, requires additional power to run the transmitters and receivers in both the memory devices as well as the master device. Added power translates to added cost since the power is supplied and then dissipated with heat sinks.
The memory device illustrated in FIG. 10 uses techniques typical of present day memory devices. In this device 1701, a single shared command bus 1710 in conjunction with the single address bus 1720 and mask bus 1730 is used to specify all of the primitive operations comprising precharge, sense, read, and write in addition to any other overhead operations such as power management.
FIG. 11 illustrates the operation of the memory device of FIG. 10. The illustrated reference sequence, when classified according to Table 2 and the universal sequence previously described comprises:                write empty—sense 1851, write 1853 with mask 1871, data 1881, close (precharge) 1861;        write miss—precharge 1852, sense 1854, write 1856 with mask 1872, data 1882;        read hit—read 1857, tristate control 1873, data 1883; and        transitional write miss—precharge 1855, sense 1858, write 1859, mask 1874, data 1884, close (precharge) 1862.        
In FIG. 11 each box represents the amount of time required to transfer one bit of information across a pin of the device.
In addition to illustrating a specific type of prior art memory device, FIG. 11 can be used to illustrate a number of techniques for specifying data transfers. One prior art technique uses an internal register to specify the number of data packets transferred for each read or write operation. When this register is set to its minimum value and the reference is anything besides a hit then the device has insufficient control bandwidth to specify all the required operations while simultaneously keeping the data pins highly utilized. This is shown in FIG. 11 by the gaps between data transfers. For example there is a gap between data a, 1881 and data b, 1882. Even if sufficient control bandwidth were provided some prior art devices would also require modifications to their memory cores in order to support high data pin utilization.
The technique of specifying the burst size in a register makes it difficult to mix transfer sizes unless the burst size is always programmed to be the minimum, which then increases control overhead. The increase in control overhead may be so substantial as to render the minimum burst size impractical in many system designs.
Regardless of the transfer burst size, the technique of a single unified control bus, using various combinations of the command pins 1810, address pins 1820, and mask pins 1830 places limitations on the ability to schedule the primitive operations. A controller which has references in progress that are simultaneously ready to use the control resources must sequentialize them, leading to otherwise unnecessary delay.
Read operations do not require masking information. This leaves the mask pins 1830 available for other functions. Alternately, the mask pins during read operations may specify which bytes should actually be driven across the pins as illustrated by box 1873.
Another technique is an alternative method of specifying that a precharge should occur by linking it to a read or write operation. When this is done the address components of the precharge operation need not be respecified; instead, a single bit can be used to specify that the precharge should occur. One prior art method of coding this bit is to share an address bit not otherwise needed during a read or write operation. This is illustrated by the “A-Prech” boxes, 1861 and 1862.
FIG. 12 shows a sequence of four read references each comprising all the steps of the universal sequence. Although the nominal transactions of Table 2 do not require the multiple precharge steps of the universal sequence it is useful to examine how well a device handles the universal sequence in order to understand its ability to support mixed empty and miss nominal transactions, as well as the transitional transactions of Table 3. As can be seen, the data pins are poorly utilized. This indicates that control contention will limit the ability of the device to transfer data for various mixes of application references. The utilization of the data pins could be improved by making the burst length longer. However, the applications, such as graphics applications, require small length transfers rather than large ones.
Another technique makes the delay from write control information to data transfer different from the delay of read control information to data transfer. When writes and reads are mixed, this leads to difficulties in fully utilizing the data pins.
Thus, current memory devices have inadequate control bandwidth for many application reference sequences. Current memory devices are unable to handle minimum size transfers. Further, current memory devices utilize the available control bandwidth in ways that do not support efficient applications. Current memory devices do not schedule the use of the data pins in an efficient manner. In addition, current memory devices inefficiently assign a bonding pad for every pin of the device.
Like reference numerals refer to corresponding parts throughout the drawings.