1. Field of the Invention
The present invention relates to semiconductor memories, and particularly to the writing of those memories incorporating a write queue.
2. Description of Related Art
Semiconductor random-access memory devices or sub-systems using arrays of dynamic memory cells (e.g., 1-transistor/1-capacitor (1T/1C) cells) have consistently provided greater density and lower cost per bit than those using static memory cells (e.g., 6-transistor (6T) cells, or 4-transistor/2-resistor (4T/2R) cells). However, such dynamic random-access memory arrays have historically also been lower in performance when compared to static random-access memory arrays. Consequently, system designers have typically chosen dynamic memory arrays (e.g., commercially available dynamic random access memories, or DRAMs) when high density and low cost are required, such as for CPU main memory applications. Conversely, designers have typically chosen static memory arrays when the highest possible performance is required, such as for cache memory and high speed buffer applications. Examples of static memory array devices or sub-systems include commercially available static random access memories (SRAMs) and CPU-resident on-board cache memory sub-systems.
The reasons often cited for the lower performance of dynamic memory arrays include the destructive sensing of all memory cells common to the addressed word line (encountered in virtually all dynamic memory arrays) and the consequential need to restore data back into each sensed memory cell during the active cycle, the need to equilibrate bit lines and various other differential nodes and to precharge various circuit nodes between active cycles, and the requirement for periodic refreshing of all dynamic memory cells.
Over the years various capabilities have been included on many circuits incorporating dynamic memory arrays to lessen the difficulty of dealing with the refresh requirements of the dynamic memory cells. On-chip refresh counters are frequently used to store a refresh address, which is used during a refresh cycle (rather than the externally provided address) to access the next row requiring refreshing, after which the refresh address is usually incremented in preparation for the next refresh cycle. These on-chip counters are helpful, even if a refresh cycle is controlled by an external clock signal, because the address path from the system need not include the delay and complexity of a multiplexer to switch between the system memory address and a refresh address. Self-refresh timing control circuits are sometimes included to automatically determine when a refresh cycle should be performed, and to automatically initiate such a cycle if the memory is not already occupied in carrying out an external memory cycle request. At one time, the asynchronous arbitration between an external cycle request and an internal refresh cycle request was worrisome because of potential meta-stability concerns, but more recently, with the increasing popularity of synchronous memories, such control circuits are also synchronous and meta-stability problems in determining what kind of cycle to initiate are largely eliminated.
One problem, however, that remains a concern for system designers is ensuring, over all possible system memory operations and address sequences, that enough time is available for sufficient refresh cycles. That is, even if the refresh control is totally handled on-chip, the memory (or portions of the memory) must be xe2x80x9cidlexe2x80x9d at least often enough to allow an occasional refresh cycle to execute. When this cannot be assured, the memory frequently must intercede over system accesses and take the necessary time to perform the refresh cycle, thus interrupting or at least delaying (e.g., wait states) normal access to the memory. Such delays degrade system efficiency and performance. Consequently, continued improvements are still desired.
In addition, at ever increasing frequencies of operation, and with more and more portable battery operated equipment is use, power dissipation is becoming ever more important. There is a continuing need to reduce power consumption wherever possible.
In an integrated circuit incorporating a write queue, the address (or a portion thereof) of a given external write cycle may be stored and compared to the address of a subsequent external write cycle. If the selected memory cells to be written in both external write cycles correspond to the same physical word line and the same column within the same array block of the same memory bank, the internal write operation which would otherwise follow from the first external write cycle is delayed, and the data to be written is queued and merged with the data to be written in the subsequent external write cycle. The write queue then xe2x80x9cretiresxe2x80x9d both queued write requests by performing a single internal write operation, simultaneously writing both data words received in the two external write cycles. Such a xe2x80x9cmergingxe2x80x9d of write cycles keeps the ultimately selected memory bank inactive during the xe2x80x9cmergedxe2x80x9d cycle, which allows a hidden refresh cycle to occur in the selected memory bank during the xe2x80x9cmergedxe2x80x9d cycle. Moreover, a significant amount of internal power consumption is saved compared to performing two separate write operations since the selected memory bank is cycled only once (instead of twice) to write the two words. This is particularly attractive when accessing the memory using sequential addresses, as would frequently occur during a burst mode access or when accessing a contiguous block of data, such as a cache line fill operation for a processor. Such sequentially-addressed consecutive write cycles may be merged even if a non-write cycle occurs between the two consecutive write cycles (i.e., the consecutive write cycles need not be consecutive cycles). Moreover, other kinds of memory arrays, particularly static memory arrays, also can benefit greatly from the power saved by merging write cycles and performing one write operation instead of two. Any write-able memory array already incorporating a write queue, or to which a write queue may be added, can benefit from this invention.
In an exemplary embodiment of the present invention, a dynamic memory array includes an internal data path to and from the array that is twice as wide as the external I/O word width. A 72-bit internal data path conveys two 36-bit words, selected by the least significant address bit. If the internal data path were wider than two 36-bit words, then more than two 36-bit write cycles could be merged into a single internal write operation. For example, if the internal data path were 144-bits wide, then four 36-bit write cycles could be merged into a single internal write operation. Moreover, there is no reason to limit cycle merging at just two consecutive cycles. As an additional example, four sequential external write cycles, each writing a different (or over-writing the same) 9-bit byte within a 36-bit word corresponding to a given address, followed by four more sequential external write cycles, each writing a different (or over-writing the same) 9-bit byte within a 36-bit word at an address which differs from the given address only in the LSB, may be carried out internally as a single internal write operation, simultaneously writing all 72-bits (assuming all 8 bytes were byte-write enabled in at least one of the eight cycles) into the selected memory cells.
In a broader embodiment of the present invention, an integrated circuit includes a memory array including a plurality of memory cells, a write queue circuit for storing address information and data for at least one pending internal write operation into the memory array, and a write decision circuit for determining whether a first group of memory cells to be otherwise written by a pending internal write operation and a second group of memory cells to be otherwise written by another internal write operation corresponding to a subsequently-received write cycle request may instead be both written using a single internal write operation. Also included is a write data merging circuit responsive to the write decision circuit for merging, if the first and second groups of memory cells may be both written using a single internal write operation, write data associated with the subsequently-received write cycle request into, and superseding any commonly-addressed data bits of, write data associated with the pending internal write operation. An internal write operation control circuit is included and configured to perform a single internal write operation to write the merged data into the memory array if the first and second groups of memory cells may both be written using a single internal write operation.
The internal write operation control circuit may be further arranged, if the first and second groups of memory cells may both be written using a single internal write operation, to omit the pending internal write operation, and to perform the single internal write operation to write the merged data into the memory array at a time after the pending internal write operation would otherwise have been performed. Alternatively, the internal write operation control circuit may be further arranged, if the first and second groups of memory cells may both be written using a single internal write operation, to perform the single internal write operation to write the merged data into the memory array at a time when the pending internal write operation would otherwise have been performed, and to omit an internal write operation that would have subsequently been performed corresponding to the subsequently-received write cycle request.
The write decision circuit may also be arranged to compare at least a portion of the address information associated with the pending internal write operation to corresponding address information associated with the subsequently-received write cycle request. The subsequently-received write cycle request may include internally generated address information for a subsequent write cycle of a burst. In other embodiments, the subsequently-received write cycle request may include an externally-received address. The address information for a given write cycle request may be a non-decoded address, or may be a partially decoded address. In other embodiments, the write decision circuit may be arranged to determine whether the first and second groups of memory cells may be both written using a single internal write operation by utilizing a signal indicating that the subsequently-received write cycle request corresponds to a subsequent write cycle of a burst.
In another embodiment of the present invention, an integrated circuit includes a memory array including a plurality of memory cells, write queue means for storing at least address information for at least one pending internal write operation, means for determining whether a first group of memory cells to be otherwise written by a pending internal write operation and a second group of memory cells to be otherwise written by another internal write operation corresponding to a subsequently-received write cycle request may instead be both written using a single internal write operation, means for merging write data associated with the subsequently-received write cycle request into, and superseding any commonly-addressed data bits of, write data associated with the pending internal write operation, and means for performing a single internal write operation to write the merged data rather than two separate internal write operations.
In yet another embodiment of the present invention suitable for use in an integrated circuit having a memory array and containing a write queue for storing at least address information associated with at least one pending internal write operation into the memory array, a method of operating the integrated circuit includes determining whether a first group of memory cells to be otherwise written by a pending internal write operation and a second group of memory cells to be otherwise written by another internal write operation corresponding to a subsequently-received write cycle request may instead be both written using a single internal write operation, and if so, then merging write data associated with the subsequently-received write cycle request into, and superseding any commonly-addressed data bits of, write data associated with the pending internal write operation, and performing a single internal write operation to write the merged data into the memory array.
In still another embodiment of the present invention suitable for use in an integrated circuit having a memory array and containing a write queue for storing at least address information associated with at least one pending internal write operation into the memory array, a method of operating the integrated circuit includes comparing at least a portion of the address information associated with a pending internal write operation stored within the write queue to corresponding address information associated with a subsequently-received write cycle request to determine whether a first group of memory cells to be otherwise written by the pending internal write operation and a second group of memory cells to be otherwise written by another internal write operation corresponding to a subsequently-received write cycle request may instead be both written using a single internal write operation. If so, the method includes then skipping the pending internal write operation, merging write data associated with the subsequently-received write cycle request into, and superseding any commonly-addressed data bits of, write data associated with the pending internal write operation, and performing a single internal write operation to write the merged data. If not so, the method includes then performing the pending internal write operation in its normal order, and then performing another internal write operation to write data associated with the subsequently-received write cycle request.
The scope of the present invention in its many embodiments is defined in the appended claims. Nonetheless, the invention and its many features and advantages may be more fully appreciated in the context of exemplary implementations disclosed and described herein which combine one or more embodiments of the invention with other concepts, architectures, circuits, and structures to achieve significantly higher performance than previously achievable. For example, a high performance dynamic memory array architecture is disclosed in several embodiments, along with various embodiments of associated supporting circuitry, which afford performance approaching that usually associated with static memory arrays.
In an exemplary embodiment an 18 MBit memory array includes four banks of arrays, each including thirty-two array blocks. Each array block includes 128 horizontally-arranged row lines (i.e., word lines) and 1152 (1024xc3x979/8) vertically-arranged columns. Most internal circuitry operates using a single positive power supply voltage, VDD, and the reference voltage VSS (i.e., xe2x80x9cgroundxe2x80x9d). Each column is implemented as a complementary folded bit line pair. Four independent row decoders are provided respectively for the four banks, and are physically arranged in two pairs, thus forming two splines, one spline located between the left pair of memory banks, and the other spline located between the right pair of memory banks. Latching input buffers for address and control inputs are located within each of the splines and are connected to respective input pads by horizontally arranged input wires running through the memory banks. Two input buffers are provided for each input pad, one located in each spline. Clock lines used to strobe the various inputs are arranged vertically, running through each spline. An R-C compensation circuit between each input wire and the corresponding latching input buffer located in the particular spline nearest its respective input pad provides a delay to the xe2x80x9cupstreamxe2x80x9d buffer which compensates for the additional wiring delay in reaching the xe2x80x9cdownstreamxe2x80x9d buffer, and which allows all of the latching input buffers to be driven by phase-aligned clock signals, and still achieve a very narrow worst case setup and hold time over all such inputs. The use of a separate input buffer in each spline for each address and control input, requiring additional interconnect wire to connect each input pad to its input buffer in the xe2x80x9cfarxe2x80x9d spline (above and beyond the interconnect wire to connect each input pad to its input buffer in the xe2x80x9cnearxe2x80x9d spline), increases the input capacitance of each address and control input to the chip (which input capacitance, of course, must be driven by the source of the external signal). However, the complementary internal outputs for each such input buffer may be buffered immediately by self-resetting buffers, and need only drive decoder and/or control circuitry locally within the same spline. Thus, the total capacitive loading on the complementary outputs of each buffer are advantageously reduced and are more balanced between the various buffers.
The row decoder uses predecoding to reduce the total line capacitance driven during an active cycle. The final stages of the row decoder includes an N-channel tree configuration driven by VDD-level (i.e., VSS-to-VDD level) pre-decoded address signals to select and discharge to VSS a particular decode node which was precharged to VPP. Subsequent buffering stages provide a final 1-of-4 decode and drive the selected word line to a VPP voltage that is substantially independent of VDD, rather than driving the selected word line to VDD or to a voltage which is a ratio of VDD. There are no race conditions within the decoder, even though it accomplishes a level shifting from VDD-level signals to VPP-level word lines.
The VPP voltage is internally generated by a charge pump type circuit and its output is a substantially fixed voltage independent of process and environmental corner which is regulated with respect to VSS (i.e., ground). For typical operating voltage, the VPP voltage is somewhat higher than VDD, although at low operating voltage the VPP voltage may be substantially higher than VDD, while at high operating voltage, the VPP voltage may be similar in magnitude to the VDD voltage. Preferably the VPP voltage is chosen to be near the maximum voltage that the field effect transistors (FETs) can safely tolerate. Since the VPP is regulated to be substantially independent of variations in the VDD voltage, the VPP level is advantageously at a higher voltage than would otherwise be safe, and tolerances in the VPP voltage level which would otherwise be necessary to account for variations in the VDD level are unnecessary.
If the semiconductor technology allows, transistors which are exposed to the VPP level (e.g., transistors whose gate terminal is driven at any time to the VPP level while the source or drain terminal might be at ground, such as the memory array access transistors and various array select transistors, or those transistors whose drain or source terminal is driven at any time to the VPP level while the gate terminal might be at ground) are preferably implemented using a thicker gate dielectric than the majority of the other transistors which are never exposed to such a high differential voltage across gate-to-drain or gate-to-source terminals. Moreover, it is also preferable to limit the voltage across any transistor using the thin gate dielectric to no more than VDD. Transistors exposed to any voltage which is greater than the VDD level are preferably implemented with the thick gate dielectric and are limited in voltage to the VPP level, which is a fixed voltage substantially independent of the VDD voltage. Consequently, transistors exposed to such internally xe2x80x9cboostedxe2x80x9d voltages need only withstand a relatively fixed, predictable voltage level (e.g., by using a bandgap reference in the circuit which regulates the VPP voltage) and do not need to withstand even higher voltages which might otherwise be produced by a xe2x80x9cboostedxe2x80x9d voltage generator whose output voltage is a ratio of VDD (e.g., 1.5xc3x97VDD). The voltage across the memory cell capacitors is limited to less than one-half VDD (e.g., limited to about 1.0 volts for certain embodiments). A third dielectric material, thinner than the xe2x80x9cthinxe2x80x9d capacitor dielectric required for typical DRAM memory cells (which must normally support a voltage of one-half the maximum allowed VDD voltage) may be advantageously used to fabricate the memory cell capacitors to provide additional storage capacitance per unit area.
Within each memory bank, a row of sense amplifiers is implemented in the holes between each pair of array blocks. Each sense amplifier is shared between two pairs of bit linesxe2x80x94one pair located within the array block above the sense amplifier and the other pair located within the array block below the sense amplifier. The complementary internal nodes within each sense amplifier are respectively connected to the true and complement bit lines above the sense amplifier by a first pair of N-channel array select transistors whose gates are driven to VSS (to isolate the sense amplifier nodes from the bit line pair) or driven to VPP (to connect the sense amplifier nodes to the bit line pair), and are further connected to the pair of bit lines below the sense amplifier by a second pair of array select transistors whose gates are likewise switchable from VSS to VPP. A row of sense amplifiers is implemented above the top array block and another row of sense amplifiers is implemented below the bottom array block of the given memory bank, which serve half of the bit lines within the top and bottom array blocks, respectively. For any particular array block, half of the bit line pairs are served by a sense amplifier located above the array block, and the remaining half are served by a sense amplifier located below the array block. A pair of array select transistors having a gate voltage switchable between VSS and VPP connects any given pair of bit lines to the complementary internal sense amplifier nodes within the corresponding sense amplifier.
An amplifier in the read path is used to develop signal on a generic I/O line before bit line sensing has occurred. Such a generic I/O line may include a global output line, a column line, or an I/O line. This amplifier may be connected to the bit lines, the sense amplifier nodes, a local I/O line serving, for example, a few bit line pairs, or a local output line similarly serving, for example, a few bit line pairs. If the read amplifier inputs are connected directly to the bit line sense amplifier nodes (i.e., one read amplifier per bit line sense amplifier), the column select function may be advantageously used to enable the amplifier for the selected column, while if the read amplifier inputs are connected to local output or I/O lines (i.e., one read amplifier per group of bit line sense amplifiers), the column select function may be used to couple the selected bit line sense amplifier to the local output or I/O lines. If the common mode voltage of the read amplifier input nodes is so low that current flow through the tail of an N-channel differential pair cannot be assured for all voltage or process corners, the amplifier may incorporate a coupling circuit to capacitively couple the tail of the differential pair downward, preferably using a controlled current source, to approximate a constant current source to a negative supply voltage.
In a certain embodiment, each read amplifier""s inputs are connected to the internal nodes of a corresponding bit line sense amplifier. The respective outputs of a group of read amplifiers are connected in common to a horizontally-arranged differential pair of local output lines. One such amplifier is enabled at a time by column select circuitry to develop signal on the pair of local output lines. A second stage amplifier then further buffers this signal and drives a pair of vertically-arranged global output lines. The global output lines extend the full height of the memory bank, with half preferably extending beyond the memory bank to I/O circuits above the memory bank, with the remaining half extending beyond the memory bank to I/O circuits below the memory bank. In certain embodiments, the second stage amplifier may also include a multiplexer to choose between two different pairs of local output lines (e.g., a first pair of local output lines serving 8 sense amplifiers located to the left of the second stage amplifier, and a second pair of local output lines serving 8 sense amplifiers located to the right of the second stage amplifier).
The word lines within the array blocks may be implemented in a polysilicon layer and strapped using a later-processed metal layer to reduce word line delays. Such word line straps are preferably implemented using two different layers of metal (preferably the two xe2x80x9clowestxe2x80x9d layers, metal-1 and metal-2) in order to match the word line pitch without requiring any distributed buffers or final decode buffers. The read amplifiers used to sense a local output line and subsequently drive a global output line may be advantageously located above word line straps where a break in the memory cell stepping already occurs. This allows the read amplifier block to more readily be laid out in the center of a group of bit line sense amplifier and column select circuits. As such, the bit line sense amplifier pitch may be slightly less than twice the column pitch (recalling that half of the bit line sense amplifiers are above the array block and the remaining half below the array block).
The bit line sense amplifiers each are implemented using a full CMOS cross-coupled latch. To sense the signal on a pair of bit lines, both the cross-coupled N-channel pair of transistors (i.e., the NMOS sense amplifier) and the cross-coupled P-channel pair of transistors (i.e., the PMOS sense amplifier) which form the CMOS sense amplifier are enabled at substantially the same time. The NMOS sense amplifier drives the bit line having a lower voltage toward VSS, while the PMOS sense amplifier drives the bit line having a higher voltage toward VDD. If enabled a sufficiently long time, the lower bit line substantially reaches VSS and the higher bit line would be driven substantially all the way to VDD. However, the PMOS sensing is terminated before the higher bit line substantially reaches the full VDD voltage. This allows the bit line to quickly be driven to a high level without having to wait for the xe2x80x9cexponential tailxe2x80x9d if it were driven all the way to VDD. The internal sense amplifier nodes and the near end of the bit lines are actually driven above and overshoot the final high bit line xe2x80x9crestorexe2x80x9d level (e.g., 2.0 volts for a device operating at a VDD of 2.5 volts) before the PMOS sensing is terminated, whereas the far end of the high bit lines have not yet reached the final high bit line xe2x80x9crestorexe2x80x9d level when the PMOS sensing is terminated. Then, after the PMOS sensing is terminated, charge is shared between the near end and far end of the bit lines, thus speeding up the far end reaching the final high bit line xe2x80x9crestorexe2x80x9d level because the effective time constant of the resistive bit line is cut in half.
Since the word line and array select lines are left high for some time even after the PMOS sense amplifier is turned off, charge sharing between the sense amplifier nodes, the near and far ends of the bit lines, and the memory cell storage node itself contribute to determining the final high restore level which is xe2x80x9cwrittenxe2x80x9d back into the selected memory cell. When compared to having a full VDD level on a high bit line, the relatively low final xe2x80x9chighxe2x80x9d bit line voltage (e.g., 2.0 volts) transfers into the selected memory cell more quickly due to the higher gate-to-source voltage of the memory cell access transistor.
The NMOS sensing is preferably continued, even after the PMOS sensing has stopped, to more adequately drive the bit line having the lower voltage (the xe2x80x9clow-goingxe2x80x9d bit line) to a substantially full VSS level. This ensures that, if the selected memory cell happens to be coupled to the low-going bit line, a substantially full VSS level is restored into the selected memory cell. This also ensures that all the low-going bit lines (not just those having a selected memory cell connected thereto) are fully discharged before, at the end of the cycle, the high and low bit lines share their charge to set the bit line equilibrate voltage. The selected word line (which is driven when active to the VPP level) is then brought low as the NMOS sensing is terminated, after which the array block is automatically taken into precharge.
Timing circuitry is used to time the simultaneous start of both NMOS and PMOS sensing relative to the timing of the selected word line being driven high, to time the end of PMOS sensing, and to time the simultaneous end of NMOS sensing and the selected word line being brought low. The PMOS sense timing duration may be designed to decrease as the VDD voltage increases to ensure a written high level which is substantially independent of VDD, even over process and temperature corners. For example, the timing may be set to ensure a written high level on the high bit line (and into the selected memory cell) of about 2.0 volts for a device having a VDD voltage range from 2.3 to 2.9 volts. Such a PMOS sense timing generator may be accomplished by using a dummy bit line and sense amplifier structure (activated substantially before the main sense amplifiers are activated), detecting when the PMOS sensing needs to be turned off to achieve a final high voltage of about 2.0 volts on the dummy sense amplifier and bit line structure, then buffering this timing signal to control the turn off time of the PMOS sense enable signals for the regular sense amplifiers within the memory arrays. The PMOS timing may alternatively be accomplished using a string of inverters powered at a voltage a fixed amount below VDD, or by other techniques to achieve a timing which is a combination of several variables, such as power supply voltage VDD, bandgap voltage, transistor threshold voltage and transconductance, temperature, or others.
In a preferred embodiment, the sense amplifier timing circuitry produces three main timing signals. The first timing signal is used to control, relative to the timing of the selected word line being driven high, the simultaneous start of both the NMOS and PMOS sensing. A second timing signal is used to control, relative to the simultaneous start of NMOS and PMOS sensing, the duration of the PMOS sensing, and a third timing signal is used to control, relative to the end of the PMOS sensing, when to simultaneously end the NMOS sensing and bring the selected word line back low. Each of these timing signals is independently generated, although the circuitry used for each may share portions with another. These three timing signals define three timing intervals. The timing interval xe2x80x9ct1xe2x80x9d begins with the selected word line being driven high and ends with the simultaneously start of both the NMOS and PMOS sensing (i.e., the timing interval xe2x80x9ct1xe2x80x9d is the amount of time the selected word line is high before sensing). The timing interval xe2x80x9ct2xe2x80x9d extends from the simultaneous start of NMOS and PMOS sensing to the end of PMOS sensing (i.e., the timing interval xe2x80x9ct2xe2x80x9d is the duration of the PMOS sensing). The timing interval xe2x80x9ct3xe2x80x9d extends from the end of the PMOS sensing to the simultaneous end of the NMOS sensing and discharge of the selected word line (i.e., the timing interval xe2x80x9ct3xe2x80x9d is the amount of time the word line remains high after the end of PMOS sensing).
The timing interval t1 essentially controls how much signal from the memory cell reaches the sense amplifier before starting the NMOS and PMOS sensing. A short t1 may not provide enough time for all the charge in a selected memory cell to fully share with the charge on the bit line and sense amplifier nodes, and consequently the sense amplifier begins to sense with less signal than would be developed if, alternatively, a longer t1 were configured. A longer t1 increases operating margins at the expense of increased cycle time. Similarly, the timing interval t2 essentially controls how much charge is driven onto the high-going sense amplifier node, bit line, and memory cell during sensing. Increasing t2 increases the voltage stored into the memory cell, but also increases the bit line equilibrate voltage when charge is later shared between true and complement bit lines (and sense amplifier nodes). A short t2 may not provide enough charge to develop the desired restored high level (e.g., 2.0 volts) on the bit line and into a selected memory cell. Conversely, an excessively long t2 timing may not increase the stored high level in the memory cell as much as it increases the bit line equilibrate voltage, and thus may decrease the high level signal available for sensing, particularly at high VDD. The timing interval t3 essentially controls how much charge is shared between the sense amplifier node, the near end and far end of a high-going bit line (which typically is moderately resistive), and the memory cell. The resistance of the NMOS memory cell access transistor is much higher when restoring a high level (due to its lower gate-to-source voltage) than when restoring a low level. The t3 timing is constrained by the time needed to write a high voltage into the selected memory cell through the resistive bit line and further through the relatively high-resistance memory cell access transistor. A short t3 may result in a worst case memory cell (one located at the xe2x80x9cfarxe2x80x9d end of a bit line, furthest from its bit line sense amplifier) being written to a restored high level which is too low, for a given amount of xe2x80x9cQxe2x80x9d transferred into the sense amplifiers (i.e., for the bit line equilibration voltage which results from the given amount of xe2x80x9cQxe2x80x9d).
These timing intervals t1, t2, and t3 may be collectively optimized on a chip-by-chip basis. In a preferred embodiment, there may be sixteen different timing settings, each specifying a particular combination of the t1, t2, and t3 timing intervals, ranging from very aggressive for highest performance, to very relaxed for highest yield. For example, the timing setting xe2x80x9c1xe2x80x9d may provide for the most aggressive (i.e., shortest) ti timing interval, the most aggressive (i.e., shortest) t2 timing interval, and the most aggressive (i.e., shortest) t3 timing interval. The timing setting xe2x80x9c16xe2x80x9d may provide for the most relaxed t1 timing interval, the most relaxed t2 timing interval, and the most relaxed t3 timing interval. Each incremental timing setting between xe2x80x9c1xe2x80x9d and xe2x80x9c16xe2x80x9d is preferably optimized to incrementally increase, by a similar amount, the signal available at the bit line sense amplifier just before sensing. To accomplish this, the timing setting xe2x80x9c2xe2x80x9d may increase the t1 interval by 200 ps compared to the xe2x80x9cmost aggressivexe2x80x9d t1 value of timing setting xe2x80x9c1,xe2x80x9d while keeping t2 and t3 unchanged (a 200 ps increase may be easily achieved by adding two inverters to the logic path setting the time interval). The timing setting xe2x80x9c3xe2x80x9d may increase t3 by 200 ps while keeping the same value of the t1 and t2 intervals as in timing setting xe2x80x9c1.xe2x80x9d Each successive low-numbered timing setting preferably increases the value of one of the three timing intervals t1, t2, and t3 relative to their values in the previous timing setting, while keeping the remaining two timing intervals unchanged. Higher numbered timing settings may increase a given timing interval by increasingly larger amounts to maintain a similar increase in the signal available at the bit line sense amplifier just before sensing, or may increase more than one of the three timing intervals. For example, the timing setting xe2x80x9c15xe2x80x9d may increase t1 and t3 each by 400 ps relative to the respective intervals in timing setting xe2x80x9c14xe2x80x9d (compared to a 200 ps increase in only t3 between timing setting xe2x80x9c2xe2x80x9d and xe2x80x9c3xe2x80x9d).
The timing setting xe2x80x9c8xe2x80x9d is preferably optimized to provide a xe2x80x9cnominalxe2x80x9d value for each of the three timing intervals t1, t2, and t3 which is expected to be an appropriate setting for a typical device having typical transistor characteristics, typical sense amplifier offset voltage, typical bit line resistance, etc. Note that these xe2x80x9cnominalxe2x80x9d values of the timing intervals t1, t2, and t3 are a function of the process corner. Higher bit line resistance, higher access transistor threshold voltage, or lower VPP, for example, raise the nominal value of each of the t1, t2, and t3 timing intervals which are called for by timing setting xe2x80x9c8.xe2x80x9d For the preferred embodiment, the various timing settings provide a variety of t1 intervals, some shorter than nominal and others longer than nominal, and provide a variety of t3 intervals, both shorter and longer than nominal. But since the duration of the PMOS sensing is so short for the nominal case, for some embodiments the shortest t2 interval provided is the xe2x80x9cnominalxe2x80x9d value, and more relaxed t2 intervals are provided for in the timing settings numbered above xe2x80x9c8.xe2x80x9d
During manufacture, this timing setting xe2x80x9c8xe2x80x9d is configured as the default setting. During a special test mode (for example, at wafer sort) the timing setting may be temporarily made more or less aggressive to determine the window of operation for each chip. Some of the memory devices are found to function correctly with very aggressive timing, while others require more relaxed timing. Then, during the fuse blowing sequence for redundancy, timing fuses may be also blown to permanently modify the default strobe timing. The timing setting is preferably set as aggressively as possible to enhance device performance, while maintaining adequate sense amplifier signal margins for reliability. For example, if a timing setting of xe2x80x9c4xe2x80x9d is the most aggressive timing for which a given device functions without error, then the device may be advantageously fuse programmed to a timing setting of xe2x80x9c6xe2x80x9d to ensure some additional operating margin (the signal to the bit line sense amplifiers increasing as the timing setting increases). At a later test, such as at final test of a packaged device, the test mode may still be entered, and the timing setting advanced from its then fuse programmed setting to a more aggressive setting, in order to further verify adequate sense amplifier margins on a chip-by-chip basis, independent of which actual timing setting was fuse programmed into the device.
A two-dimensional grid of power buses is preferably implemented within each memory bank, with large VDD and VSS buses arranged parallel to the bit lines and implemented in a higher layer of metal (e.g., the top layer), vertically passing above the bit lines. Filter capacitors are located at the ends of each array block as well as at the top and bottom of each memory bank to help provide additional bypass capacitance to withstand the large current spikes which occur during sensing. These filter capacitors, as well as other filter capacitors implemented elsewhere within the device, are preferably implemented using multiple, independent capacitors which are individually de-coupled and automatically switched out of the circuit if, at any time, more than a predetermined leakage current is detected automatically by the memory device as flowing through a given capacitor (i.e., a xe2x80x9cshortedxe2x80x9d capacitor). The large metal buses allow this stored charge to reach the two selected rows of sense amplifiers (i.e., located in the holes above and below the selected array block) with very little voltage drop, and allow the sense amplifiers to latch quickly and provide a good VSS low level.
The bit lines are equilibrated together to achieve an equilibration voltage on the bit lines, for a preferred embodiment, of approximately 1.0 volts. The bit lines are preferably equilibrated at both ends to reduce the required equilibrate time. The bit line equilibration voltage is coupled from all bit line pairs to a common node which may be sampled just after equilibration and buffered (using a sample-and-hold amplifier) to drive the memory cell plate. Since the bit line equilibration voltage is approximately one-half the written high level, the bit line equilibration voltage may also be sampled, compared to a reference voltage (for example, a 1.0 volt reference), and any voltage difference used to adjust the PMOS timing (and thereby adjust the final written high level).
As stated above, the exemplary memory array is automatically taken back into precharge without waiting for a control signal. In other words, one edge of a clock causes the memory array to execute a useful cycle, then to automatically reset itself in preparation for a new cycle. This precharge timing is relative to the beginning of the active cycle. Of significance, this limits the amount of potential sub-threshold leakage through memory cell access transistors by limiting the time that any bit lines are at VSS. The precharging/equilibration is accomplished by using two sets of signalsxe2x80x94one is an automatically timed pulse, while the other stays on until the start of the next cycle. For example, the bit line sense amplifiers are preferably equilibrated using two different equilibrate signals. Both turn on automatically at the same time after NMOS sensing is complete and the selected word line is brought low. One equilibrate signal is turned off by a timed pulse just when the bit line equilibration is substantially complete (i.e., at the end of the active cycle), while the other equilibrate signal is turned off by the start of the subsequent cycle. The pulsed equilibrate signal drives much larger internal capacitive loads, such as large equilibration devices, while the non-pulsed equilibrate signal drives fewer and/or much smaller devices which indeed assist the larger pulsed equilibrate devices in equilibrating the various nodes. However, the smaller devices are largely included as xe2x80x9ckeepersxe2x80x9d to maintain the equilibration until the next active cycle. As such, the total capacitance of the various equilibration signal lines which must be discharged (i.e., brought low) at the start of new cycle is greatly reduced and can be accomplished with less delay after the initiating control signal, and the performance is enhanced. For relaxed clock cycle times, the pulsed equilibrate signal falls automatically at the end of a cycle, while the non-pulsed equilibrate signal stays high until the next cycle selecting this array block is initiated. However, for a clock cycle time which approaches the fastest possible cycle time for a given device, the non-pulsed equilibrate signal for the newly selected array block may be discharged by the initiation of the next cycle at substantially the same time as the pulsed equilibrate signal for the previously selected array block is discharged automatically at the end of the previous cycle. To save power, the non-pulsed equilibrate signal for only the selected array block and supporting circuitry is brought to VSS at the start of an active cycle, and all others remain inactive at VDD throughout the active cycle. Similarly, the pulsed equilibrate signal for only the selected array block and supporting circuitry is actually pulsed at the end of an active cycle, while all others remain inactive at VSS.
During an internal write operation, the exemplary device contains write circuitry that supplies a small differential voltage to the sense amplifier before bit line sensing, the polarity of the voltage depending on the data to be written. The circuitry furthermore xe2x80x9cswallowsxe2x80x9d the voltage otherwise developed in the sense amplifier by the selected memory cell. Then, during their normal latching, the bit line sense amplifiers then xe2x80x9cwritexe2x80x9d the level into the memory cell. Because of an internal write queue, the data to be written is already available when the actual internal write operation is started. In preparation for the current write operation, this data is preferably driven onto the global input lines late in the previous write operation, and then coupled to the selected sense amplifier by column select circuitry fairly early in the current write operation, before latching the bit line sense amplifiers. The magnitude of the write signal coupled onto the sense amplifier nodes is kept small to reduce power consumption and to reduce disturbance to the neighboring bit lines and sense amplifiers which are not being written. Preferably, the magnitude of the write signal imparted onto any given sense amplifier node is no higher than that normally developed during a read operation, so that coupling to the neighboring bit lines and sense amplifiers is no worse than during a read operation. The global input lines serving the next word to be written are equilibrated after each write operation, preferably to the bit line equilibration voltage, and driven to the new data state for the next write operation, even if the next write operation is not the next cycle. Moreover, the differential voltage on the global input lines serving the next word to be written is equilibrated away (in a write cycle) after bit line sensing has started and the column select lines are inactive (i.e., during the later stages of bit line sensing), and then driven to reflect the new write data for the following write cycle before the bit lines have finished equilibrating, rather than driving these data input signals during the early part of bit line sensing when such movement could disturb the bit line sensing. The global input lines then dynamically float until needed by the next write operation. To handle the possibility that the next write operation may be many cycles later, the global input lines may be refreshed periodically (e.g., every 256 external clock cycles, before any leakage current can substantially modify their voltage) by re-equilibrating and re-driving to ensure the proper magnitude of the write data signal for as long as necessary until the next write operation occurs.
By writing a dynamic memory array by xe2x80x9cfoolingxe2x80x9d the sense amplifier and letting it actually restore the voltage levels onto the bit lines in accordance with the data to be written, rather than in accordance with the data previously in the selected memory cell, a write cycle takes the same very short time as a read cycle, rather than the longer time that would be required by first sensing old data, then modifying it. In addition, a significant amount of power is saved by not having to over-power many sense amplifiers after they have already been latched.
During power-up, all the memory cells are initialized to a low voltage under automatic internal control. Provision is made to allow every word line to simultaneously go high, to force the node to which the bit lines are equilibrated to VSS, and to ensure that the bit line equilibration and array select transistors are on. Since each sense amplifier is then coupled to a common node at VSS by precharge signals, each bit line (both true and complement) is driven to VSS and all memory cells are likewise forced to VSS, even if the word lines are no higher than a threshold voltage above VSS. At about the same time, the memory cell plate is established at a voltage near the eventual bit line equilibration voltage (preferably around 1.0 volts) by other power-up circuits, being careful to limit the current flow, which charges the cell plate, to an amount less than the output current of the substrate bias charge pump (to prevent the substrate from coupling positively and causing massive latchup from the diffused regions of each memory cell""s internal node). Then, when normal cycles begin, the very first operation in the memory array occurs with memory array nodes (bit lines, cell plate) properly established, and all memory cells initialized at one of the two valid states (in this example, at VSS). The first cycles do not have to try to sense memory cells having an initialized voltage near the bit line equilibration voltage, as would likely occur without such a power-up sequence due to coupling from the memory cell plate to the memory cells themselves as the memory cell plate reaches its normal level at the bit line equilibration voltage of, for example, 1.0 volts. This prevents any bit line sense amplifiers which are not being written from spending time in a meta-stable state which, if allowed to occur, would affect the high level restored into the memory cells being written, as well as the equilibrate voltage resulting on the bit lines.
During a read operation, signal developed on the bit lines by the selected memory cell is immediately buffered by the local output line amplifier(s) before bit line sensing starts, and immediately starts to develop signal on the pair of global output lines. For certain embodiments, the differential signal propagates through lines and differential amplifiers to the output buffers, whose first stage is a latching amplifier which is then strobed to detect, amplify, and latch this signal. The timing of the strobe signal for this latching amplifier (which may be known as xe2x80x9ct4xe2x80x9d) may be optimized on a chip-by-chip basis. There may be, for example, eight possible strobe timings, from very aggressive to very relaxed. The device may be initially configured with an intermediate default strobe timing (e.g., having a value of xe2x80x9c4,xe2x80x9d where xe2x80x9c1xe2x80x9d is the most aggressive and xe2x80x9c8xe2x80x9d is the most relaxed), and during a special test mode (for example, at wafer sort) the strobe timing may be made more or less aggressive to determine the window of operation for each chip. Then, during the fuse blowing sequence for redundancy, timing fuses may be also blown to modify the default strobe timing. The timing is modified to be as aggressive as possible while maintaining adequate margins for reliability. For example, if in the test mode a t4 timing of xe2x80x9c2xe2x80x9d is the fastest timing for which a given device functions without error, then the device may be advantageously fuse programmed to a t4 timing of xe2x80x9c3xe2x80x9d or not altered to remain at xe2x80x9c4xe2x80x9d to ensure sufficient operating margin. At a later test, such as at final test of a packaged device, the test mode may again be entered, and the t4 timing advanced from its then fuse programmed setting to a more aggressive setting (e.g., 1 or 2 settings faster than its new programmed timing setting without needing to know the new programmed timing setting), in order to further verify adequate operating margins on a chip-by-chip basis, independent of which actual timing setting was fuse programmed into the device.
In an alternative embodiment of a memory array having a cycle time which is long compared to its read access time, a latching global output line amplifier may be strobed (at what was time t4 in the earlier embodiment) to detect and amplify the signal on the pair of global output lines, and communicate the sensed data onward through output multiplexer circuitry and ultimately (if the particular global output line is selected) to output buffer circuitry. The timing of the global output line amplifier may be selected to support both a flow-through configuration as well as a pipelined configuration. To support a fast flow-through access time specification, the latching global output amplifier is aggressively strobed as soon as a predetermined amount of signal has developed on the global output lines. In this way, the data propagates to and is available at the outputs as quickly as possible. But with this aggressive timing, some devices may fail. Conversely, when in the pipelined mode of operation, the global output latch timing is relaxed to more closely coincide with the global output signal peak, and the sensed data is provided to the output buffers for driving to the output pins during the next cycle (using a PLL or delay-locked loop). By affording additional time for even more signal to develop on the global output lines, a particular device which may be marginal or may even fail at the fast t4 timing of the flow-through mode may prove to have adequate margin at the more relaxed timing of the pipelined mode, and may be sold for use and guaranteed to operate only in the pipelined mode of operation.
Bit line crossover structures are advantageously used to achieve lower worst case coupling, during both read or write operations, onto a particular bit line pair from neighboring bit lines on either side. Because photolithographic guard cells are used at the edges of each arrayed group of memory cells, there is a layout area penalty in providing crossover structures including the required guard cells on either side of each crossover structure. To reduce this area penalty, a novel crossover arrangement is employed, for certain embodiments, which provides a significant degree of noise (i.e., coupling) reduction while requiring only one crossover. Within each array block, each complementary pair of bit lines runs vertically from the top to the bottom of the array block. The true bit line and complement bit line of a first pair run adjacent to each other from the top to the bottom of the array block without any crossovers. The true bit line and complement bit line of a second pair do not run adjacent to each other, but instead straddle the first pair (i.e., both true and complement bit lines of the first pair lie between the true and complement bit lines of the second pair), with a single crossover half-way down the second bit line pair (vertically in the middle of the array block). This crossover arrangement repeats horizontally throughout each array block in groups of two pairs of bit lines (four physical bit line wires). By using this crossover arrangement, only four groups of guard cells are required in each array blockxe2x80x94one each at the top and bottom of the array block, and one each at the top and bottom of the single crossover structure located in the vertical center of the array block.
The address and data for a write cycle are queued to eliminate dead cycles on the system data bus. In the exemplary embodiment operated in the pipelined mode, the address for a read cycle is strobed during one cycle, and the corresponding data read from the selected memory cells is driven onto the external data pins during a subsequent cycle. If an external write cycle follows immediately after an external read cycle, the write address may be presented to the address bus and strobed into the memory device just like for a read cycle, but the external bi-directional data bus is occupied with driving the data out corresponding to an earlier external read cycle (by a number of cycles depending on the pipeline latency for a particular embodiment) and cannot be used to present the corresponding write data. Instead, the data for the external write cycle is driven onto the data bus and presented to the device during the cycle in which output data would have appeared had the cycle been an external read cycle instead of an external write cycle. In this way, the address bus and the data bus are used every cycle, with no wasted cycles for either bus. Both the write address and data are queued, the actual write operation to physically store the write data into the selected memory cells is postponed until a subsequent write cycle, which then, when executed, retires the previously received address and data from the write queue into the memory array. Read bypass circuitry is provided which allows data corresponding to the address of the read cycle to be correctly read from the write queue whenever an earlier queued write directed to that same address has not yet been retired.
In the exemplary embodiment, the internal data path is twice as wide (i.e., a xe2x80x9cdouble wordxe2x80x9d) as the external I/O word width (i.e., the least significant address bit selects one of the two possible 36-bit words), and a significant degree of internal power consumption is saved by merging external write cycles when sequential write addresses occur. The address of a given external write cycle is stored and compared to the address of the next external write cycle. If the selected memory cells to be written in both external write cycles correspond to the same physical word line and the same column within the same array block of the same memory bank (i.e., differ in only the least significant address bit), the internal write operation which would otherwise follow from the first external write cycle is delayed, and the data to be written is queued and merged with the data to be written in the second external write cycle. The write queue then xe2x80x9cretiresxe2x80x9d both queued write requests by performing a single internal write operation, simultaneously writing both data words received in the first and second external write cycles. If the internal data path were wider than 72-bits, then more than two 36-bit write cycles could be merged into a single internal write operation. For example, if the internal data path were 144-bits wide, then four 36-bit write cycles could conceivably be merged into a single internal write operation.
The exemplary embodiment includes a burst mode of operation which provides, during subsequent cycles, read or write access to sequential addressed memory cells relative to a received (i.e., xe2x80x9cloadxe2x80x9d) address, without requiring such sequential addresses be presented to the device. Using the 72-bit wide (double word) organization of each memory bank, two 36-bit words are retrieved from the memory array in the first cycle. The second word is saved to present to the data outputs after the first word is output. Because the exemplary device is organized into separate memory banks, a burst of four sequential words may transcend the address boundaries between memory banks. Consequently, the exemplary device includes provision for automatically initiating a load cycle in another memory bank during a burst cycle.
In certain embodiments, a dynamic memory array using the architecture and supporting circuits described above achieves random access cycles (each requiring a new random row access) at a sustained rate in excess of 200 MHz operation, even when each new row access is within the same array block of the same memory bank.
The present invention may be better understood, and its numerous objects, features, and advantages made even more apparent to those skilled in the art by referencing the detailed description and accompanying drawings of the embodiments described below.