1. Field of the Invention
The present invention relates to the transfer of data in digital systems. More specifically, the present invention relates to a protocol and apparatus that provide improved interconnect utilization. In particular, a two-step write operation according to the present invention avoids resource conflicts, thus permitting read and write operations to be issued in any order while maintaining continuous data traffic.
2. Description of the Related Art
A computer, such as a computer system 10 shown in FIG. 1A, typically includes a bus 12 which interconnects the system's major subsystems such as a central processing unit (CPU) 14, a main memory 16 (e.g., DRAM), an input/output (I/O) adapter 18, an external device such as a display screen 24 via a display adapter 26, a keyboard 32 and a mouse 34 via an I/O adapter 18, a SCSI host adapter 36, and a floppy disk drive 38 operative to receive a floppy disk 40. SCSI host adapter 36 may act as a storage interface to a fixed disk drive 42 or a CD-ROM player 44 operative to receive a CD-ROM 46. Fixed disk 42 may be a part of computer system 10 or may be separate and accessed through other interface systems. A network interface 48 may provide a connection to a LAN (e.g., a TCP/IP-based local area network (LAN)) or to the Internet itself. Many other devices or subsystems (not shown) may be connected in a similar manner. Also, it is not necessary for all of the devices shown in FIG. 1A to be present to practice the present invention, as discussed below. The configuration of the devices and subsystems shown in FIG. 1A may vary substantially from one computer to the next.
In today's high-performance computers, the link between the CPU and its associated main memory (e.g., CPU 14 and main memory 16, respectively) is critical. Computer programs currently available place imposing demands on a computer's throughput capabilities. This need for increasingly higher bandwidth will continue.
One method for improving the throughput of this interface is to provide a dedicated bus between CPU 14 and main memory 16. Such a bus is shown in FIG. 1A as a memory bus 50. Memory bus 50 allows CPU 14 to communicate data and control signals directly to and from main memory 16. This improves computational performance by providing a pathway directly to the system's main memory that is not subject to traffic generated by the other subsystems in computer system 10. In such systems, the pathway between main memory 16 and bus 12 may be by way of a direct memory access (DMA) hardware construct for example.
FIG. 1B illustrates a block diagram in which components (e.g., CPU 14 and main memory 16) communicate over an interconnect 60 in order to process data. Interconnect 60 is a generalization of memory bus 50, and allows one or more master units such as master units 70(1)-(N) and one or more slave units, such as slave units 80(1)-(N). (The term “N” is used as a general variable, its use should not imply that the number of master units is identical to the number of slave units.) Components attached to interconnect 60 may contain master and slave memory elements. In the case where interconnect 60 serves as memory bus 50, CPU 14 communicates with main memory 16 over interconnect 60 using pipelined memory operations. These pipelined memory operations allow maximum utilization of interconnect 60, which is accomplished by sending data over interconnect 60 as continuously as is reasonably possible given the throughput capabilities of main memory 16.
The block diagram of FIG. 1B is applicable to intrachip, as well as interchip, communications. It will be understood that one or more of slave units 80(1)-(N) may consist of other components in addition to memory (e.g., a processor of some sort). The block diagram of FIG. 1B can, of course, be simplified to the case of a system having only a single master.
FIG. 1C shows a memory device 100. Memory device 100 might be used in a computer system, for example, as main memory 16 of computer system 10, or in combination with similar devices to form main memory 16. Memory device 100 is capable of being read from and written to by a memory controller (not shown). An interconnect 110 is used to communicate control information over control lines 112 and data over data lines 114 from the memory controller to memory device 100. Interconnect 110 is thus analogous to memory bus 50. To support such communications and the storage of data, memory device 100 typically includes three major functional blocks.
The first of these, a transport block 120, is coupled to interconnect 110. Interconnect 110, which includes control signal lines 112 and data signal lines 114, is used to read from and write to memory device 100. Interconnect 110 provides the proper control signals and data when data is to be written to memory device 100. Transport block 120 receives these signals and takes the actions necessary to transfer this information to the remaining portions of memory device 100. When memory device 100 is read, transport block 120 transmits data as data signal lines 114 in response to control signal lines 112. Transport block 120 includes a control transport unit 122 which receives control signal lines 112, and controls a read data transport unit 124 and a write data transport unit 126 to support the communication protocol used in transferring information over interconnect 110 (e.g., transferring information between CPU 14 and main memory 16 over memory bus 50).
In its simplest form, transport block 120 is merely wiring, without any active components whatsoever. In that case, control transport unit 122 would simply be wires, as read data transport unit 124 and write data transport unit 126 would require no control. In effect, transport block 120 is not implemented in such a case. Another possible configuration employs amplifiers to provide the functionality required of transport block 120. In yet another possible configuration, transport block 120 includes serial-to-parallel converters. In this case, control transport unit 122 controls the conversion performed by read data transport unit 124 and write data transport unit 126 (which would be the serial-to-parallel converters). Other equivalent circuits may also be used with equal success.
The second of the major functional blocks is an operations block 130. Operations block 130 receives control information from transport block 120, more specifically from control transport unit 122, which provides the requisite signals to a control operation unit 150.
In FIG. 1C, control operation unit 150 is implemented as an architecture designed to control generic DRAM memory cells. A specific DRAM memory cell architecture (or other architecture), however, may require different control signals, some or all of which may not be provided in the architecture shown in FIG. 1C. Control operation unit 150 includes a sense operation unit 132, a precharge operation unit 134, and a core transfer operation unit 136.
Data being read is transferred from the third functional block, a memory core 180, via data I/O bus 185 to a read data operation unit 160. From read data operation unit 160, the data being read is transferred to read data transport unit 124 (and subsequently, onto data signal lines 114) in response to control signals from control operation unit 150. Read data operation unit 160 may consist of, for example, data buffers (not shown) that buffer the outgoing data signals to drive read data transport unit 124.
Data to be written is transferred from write data transport unit 126 to a write operation unit 170 in response to control signals from control transport unit 122 (if used) and control operation unit 150. Write data operation unit 170 receives write data from write transport unit 126, which is passed on to memory core 180 via data I/O bus 185. As shown, write data operation unit 170 may be controlled by core transfer operation unit 136. Write data operation unit 170 may consist of, for example, data buffers (not shown) that buffer the incoming data signals.
Write data operation unit 170 may also contain mask buffers that buffer mask information received from write data transport unit 126. As with data buffering, these actions may be taken under the control of core transfer operation unit 136. The mask information is then passed to memory core 180 via data I/O bus 185, as well. The mask information is used by the memory core to selectively write parts of the data within the memory core. Alternatively, no mask is employed, with the result that all the data is written unconditionally.
The circuitry of control operation unit 150 may take any number of appropriate configurations, depending in part on the architecture of the memory core employed. For example, the memory cells of memory core 180 may be static random access memory (SRAM) cells, read-only memory (ROM) cells (which can, of course, only be read), dynamic RAM (DRAM) cells, or another type of memory cell. The type of memory cell employed in memory core 180 affects the architecture of control operation unit 150, as different memory cells often require different control signals for their operation.
Operational block 130 thus contains core transfer operation unit 150, read data operation unit 160, and write data operation unit 170. Again, in the simplest configuration of transport block 120, the subsystems of transport block 120 are merely wires. Moreover, the functionality provided by the subsystems of transport block 120 is merely one of transferring data and control information.
Assuming that the memory core employs DRAM-type memory cells, operations which may be performed on memory core 180 (referred to herein as core operations) may be generalized into four primary categories:                1) Precharge;        2) Sense;        3) Read; and        4) Write.While these generalized operations are dealt with in detail later in this section, they are introduced here to illustrate the following effects on the block diagram of FIG. 1C. Given the generalized operations to be performed, the circuitry of control operation unit 150 may be logically divided into the three subsystems mentioned previously: sense operation unit 132, precharge operation unit 134, and core transfer operation unit 136. Core transfer operation unit 136 controls read data operation unit 160 and write data operation unit 170 when transferring data from and to memory core 180, respectively (i.e., read and write operations). Core transfer operation unit 136 also controls memory core 180, causing memory core 180 to store write data and output read data. Precharge operation unit 134 controls memory core precharge operations, which precharge the selected banks in memory core 180. Sense operation unit 132 is provided for the control of memory core sense operations.        
The subsystems of operations block 130 uses the control information received to coordinate movement of control and data information to and from memory core 180. Read data operation unit 160 and a write data operation unit 170 contain circuitry specific to the functions which read and write data from and to memory core 180, respectively. Core transfer operation unit 150 contains circuitry used to control memory core 180, including circuitry for the control of read and write operations. Core interface signals 190 are provided to control memory core 180.
FIG. 2 illustrates a memory core 200, which can serve as memory core 180 in FIG. 1C. Memory core 200 typically includes several basic functional blocks. Memory core 200 is illustrated as including multiple memory banks, memory banks 205(1)-(N). Alternatively, memory core 200 can be implemented using only a single memory bank (e.g., memory bank (1)). Included in each of memory banks 205(1)-(N) are a storage array, exemplified by storage arrays 210(1)-(N), and a set of sense amplifiers, exemplified by sense amplifiers 215(1)-(N). Storage arrays 210(1)-(N) are central to the function of memory core 200, actually holding the data to be stored. Storage arrays 210(1)-(N) are connected to sense amplifiers 215(1)-(N) by bit lines 220(1)-(N), respectively. Such storage arrays are normally organized into rows and columns of storage cells, each of which typically stores one bit of information, although configurations for storing multiple bits are known in the art.
Also included in memory core 200 are a row decoder 225 and a column decoder 230. A row address 235 is provided to row decoder 225, along with row control signals 240, which cause row decoder 225 to latch a row address thus presented. In turn, row decoder 225 presents this address information to memory banks 205(1)-(N) via row select lines 245. Similarly, a column address 250 is provided to column decoder 230, along with column control signals 255, which cause column decoder 230 to latch a column address thus presented. In turn, column decoder 230 presents this address information to memory banks 205(1)-(N) via column select lines 260 to select which sense amplifiers are connected to the column amplifiers. The column control signals 255 may include mask bit signals to selectively mask individual sense amplifiers in accordance with a predetermined masking scheme.
Column control signals 255 are also provided to column amplifiers 265. Column amplifiers 265 are coupled to sense amplifiers 215(1)-(N) by column I/O lines 266, and amplify the data signals input to and output from sense amplifiers 215(1)-(N). Column amplifiers 265 are also coupled to data I/O bus 185 (from FIG. 1C), permitting the communication of control signals from operations block 130 to the various control structures within memory core 200. The signals aggregated as core interface signals 190 (as illustrated in FIG. 1C) thus include row address 235, row control signals 240, column address 250, and column control signals 255. Thus, the interface to a memory core generally consists of a row address, a column address, a datapath, and various control signals, including mask signals.
As shown in FIG. 2, memory cores can have multiple banks, which allows simultaneous row operations within a given core. The use of multiple banks improves memory performance through increased concurrency and a reduction of conflicts. Each bank has its own storage array and can have its own set of sense amplifiers to allow for independent row operation. The column decoder and datapath are typically shared between banks in order to reduce cost and area requirements, as previously described.
FIG. 3 illustrates a generic storage array 300, in which data is stored in storage cells 305(1,1)-(N,N). Thus, storage array 300 is capable of storing N2 bits, using a common storage cell implementation. As shown, each one of word lines 310(1)-(N) accesses a row of storage cells 305(1,1)-(N,N) (e.g., storage cells 305(1,1)-(1,N)), which in turn transfers the stored data onto internal bit lines 320(1)-(N). Internal bit lines 320(1)-(N) emerge from storage array 300 as bit lines 220 (i.e., an aggregate of bit lines 220(1)-(N), which are connected to sense amplifiers 215(1)-(N)).
Accessing the information in a storage array (i.e., reading data stored in storage arrays 210(1)-(N)) is typically a two step process. First, data is transferred between storage array 300 and a corresponding set of sense amplifiers 215(1)-(N). Next, the data is transferred between the sense amplifiers involved and the column amplifiers 265. Certain memory core architectures do away with the column amplifiers, transferring the data from the sense amplifiers directly to the data I/O bus (i.e., data I/O bus 190).
The first major step, transferring information between storage arrays 210(1)-(N) and sense amplifiers 215(1)-(N), is known as a “row access” and is broken down into the minor steps of precharge and sense. The precharge step prepares the sense amplifiers and bit lines for sensing, typically by equilibrating them to a midpoint reference voltage. During the sense operation, the row address is decoded, a single word line is asserted, the contents of the storage cell is placed on the bit lines, and the sense amplifiers amplify the value to full rail (i.e., a full digital high value), completing the movement of the information from the storage array to the sense amplifiers. Of note is the fact that the sense amplifiers can also serve as a local cache which stores a “page” of data which can be more quickly accessed with column read or write accesses. The second major step, transferring information between the sense amplifiers and the interface, is called a “column access” and is typically performed in one step. However, variations are possible in which this major step is broken up into two minor steps, e.g. putting a pipeline stage at the output of the column decoder. In this case the pipeline timing should be adjusted to account for the extra time involved.
These two steps give rise to the four basic memory operations mentioned previously: precharge, sense, read, and write. A typical memory core can be expected to support these four operations (or some subset thereof). However, certain memory types may require additional operations to support architecture-specific features. The general memory core described provides the basic framework for memory core structure and operations. However, a variety of memory core types, each with slight differences in their structure and function, exist. The three major memory core types are:                Dynamic Random-Access Memory (DRAM)        Static Random-Access Memory (SRAM)        Read-Only Memory (ROM)        
The structure of a conventional DRAM core is similar to the generic memory core in FIG. 2. Like memory core 200, the conventional DRAM structure has a row and column storage array organization and uses sense amplifiers to perform row access. As a result, the four primary memory operations (sense, precharge, read and write) are supported. Memory core 200 includes an additional column amplifier block and column amplifiers 265, which are commonly used to speed column access in DRAM (and other memory core types, as well). Also illustrated by FIG. 2 is the use of multiple banks, a common configuration for conventional DRAM cores. As before, the row decoder, column decoder, and column amplifiers are shared among the banks. An alternative configuration replicates these elements for each bank. However, replication typically requires larger die area and thus incurs greater cost.
Inexpensive core designs with multiple banks typically share row decoders, column decoders, and column datapaths between banks to minimize die area, and therefore cost.
Conventional DRAM cores use a single transistor cell, known as a 1T cell. The single transistor accesses a data value stored on a capacitor. The 1T cell is one of the storage cell architectures that employs a single bit line, as referred to previously. This simple storage cell achieves high storage density, and hence a low cost per bit. However, designs employing such storage cells are subject to two limitations. First, such storage cell architectures exhibit slower access times than certain other storage cells, such as SRAM storage cells. Since the passive storage capacitor can only store a limited amount of charge, row sensing for conventional DRAM storage cells (i.e., 1T cells) takes longer than for other memory types with actively-driven cells (e.g., SRAM storage cells). Hence, the use of a 1T storage cell architecture generally results in relatively slow row access and cycle times.
Second, such storage cell architectures require that the data held in each cell be refreshed periodically. Because the bit value is stored on a passive capacitor, the leakage current in the capacitor and access transistor result in degradation of the stored value. As a result, the cell value must be “refreshed” periodically. The refresh operation consists of reading the cell value and re-writing the value back to the cell. These two additional memory operations are named refresh sense and refresh precharge, respectively. In traditional cores, refresh sense and refresh precharge were the same as regular sense and precharge operations. However, with multiple bank cores, special refresh operations may be advantageous to enable dedicated refresh circuits and logic to support multibank refresh.
To perform a row access in a conventional DRAM having a single bank, bit lines 220(1)-(N) and sense amplifiers 215(1)-(N) must first be precharged, typically to one-half of the supply voltage (Vdd/2). The row precharge time, tRP, is the time required to precharge the row to be sensed. To perform a sense operation, row decoder 225 drives a single word line (e.g., one of word lines 310(1)-(N)) to turn on each of the memory cells' access transistors (not shown) in the row being sensed. The charge on each of the memory cells' storage capacitors (also not shown) transfers to its respective bit line, slightly changing the corresponding bit line's voltage. The sense amplifier detects this small voltage change and drives the bit lines to either Vdd or ground, depending on the voltage change produced by the capacitor's charge. The wordline must be held high a minimum time period of tRAS,MIN to complete the sensing operation. At some time before the bit lines reach their final value, a column read or write access can begin. The time between the start of the sense operation and the earliest allowable column access time is tRCD (the row-to-column access delay). The total time to perform both precharge and sense is tRC, the row cycle time, and is a primary metric for core performance.
Row access timing for DRAMs with multiple banks, such as that illustrated in FIG. 2, differs slightly from the preceding example. The delay tPP specifies the minimum delay between precharge operations to different banks. This indicates that the precharge circuitry is able to precharge the next row (which may be the same row originally precharged) after a period of tPP. Typically, tPP is approximately equal (or even less than) tRP, assuming the same memory core and device architecture are employed. Similarly, tSS specifies the minimum delay between performing sense operations on different banks. As before, the sensing on different banks can be carried out more quickly than repeated sensing on the same bank. These parameters indicate that, while the precharge circuitry can precharge a row every tPP seconds and sense circuitry can sense every tSS seconds (both of which are usually measured in ns), a single bank's storage array can only be precharged (or sensed) every tRC seconds (measured in ns). Thus, a memory core employing multiple banks can be read from and written to more quickly in situations where different banks are being accessed.
Typical column cycle times and access times greatly depend on the type of sense amplifier circuit employed. This is because the sense amplifiers drive the selected data onto the column data I/O wires, and must be able to drive the capacitance that those wires represent (i.e., the amplifier must be able to charge that capacitance in the requisite time). Increased speeds can be achieved by improving the sense amplifier's drive capability, thus charging the column data I/O wires capacitance more quickly. This could be done by using more or larger transistors in the sense amplifier circuit. However, such modifications greatly increase die area, and so cost, especially because the sense amplifier circuit is so heavily replicated. Thus, the desire to minimize the die area of commodity DRAMs limits the further reduction of column access speeds by this technique.
In a conventional DRAM, the column decoder's output drives a single column select line, which selects some or all of the outputs from the sense amplifiers. The column decoder's output may be placed in a register for pipelined designs. The selected sense amplifiers then drive their respective data onto the column I/O wires. To speed column access time, the column I/O lines are typically differential and sensed using differential column amplifiers (e.g., column amplifiers 265 in FIG. 2), which amplify small voltage differences on the column I/O wires and drive data I/O bus 185. The width of the column I/O bus determines the data granularity of each column access (also known as CAS block granularity).
Unfortunately, the preceding DRAM timing parameters (and others) can vary widely due to variations in manufacturing processes, supply voltage, operating temperature, and process generations, among other factors. In order for a memory architecture to operate properly given such variations, it is important for a DRAM protocol to be able to support these varied row and column timings.
In a conventional DRAM, column control signals 255 of FIG. 2 typically include a column latch signal, a column cycle signal, and write mask signals. The column latch signal precedes the column cycle signal, and causes column decoder 230 to latch the column address (column address 250). In this type of architecture, the column cycle signal indicates the actual beginning of the column access process, and therefore is required to wait for the column address to be latched. Some DRAM memory cores also include the ability to mask write data. With masking, a write operation is performed such that some bits or bytes of the datapath are not actually written to the storage array depending on the mask pattern. Typically, the mask pattern is delivered to the column amplifier write circuit, which inhibits the write data in an appropriate manner. Moreover, data I/O bus 185 and/or column I/O lines 266 can be either bidirectional, in which case write and read data are multiplexed on the same bus, or unidirectional, in which case separate write and read datapaths are provided. While FIG. 2 illustrates data I/O bus 185 as a bidirectional bus, the use of a unidirectional bus can easily be envisioned.
FIG. 2 may also be used to illustrate a memory core employing an SRAM storage cell architecture. The typical SRAM memory core architecture shares the core structure and functionality of the conventional DRAM memory architecture discussed previously. Moreover, accesses are performed in a two-step process similar to that used in accessing data held in a DRAM memory core. First, during the sense operation, the information is transferred between the storage array and the sense amplifiers. Second, in the column access operation, the information is transferred between the sense amplifiers and the interface. Another similarity to DRAM is the need to precharge the bitlines prior to sensing operations, although typical precharge value is the supply voltage, not half of the supply voltage normally used in conventional DRAM architectures.
SRAM memory cores differ markedly from DRAM memory cores in the architecture of the storage cells used in each. In an SRAM memory architecture, data is stored statically, typically using a circuit of several transistors. A typical SRAM storage cell uses cross-coupled CMOS inverters to store a single data bit, and employs the bit line pairs as illustrated in FIG. 3 (internal bit lines 220(1)-(N), e.g., differential bit lines). A word line (one of word lines 310(1)-(N)) turns on access transistors within the selected SRAM storage cells (e.g., storage cells 305(1,1)-(1,N)), which connect each cell in the row to the differential bit lines (internal bit lines 320(1)-(N)). Unlike a DRAM cell, however, each SRAM storage cell actively drives the stored value onto its respective bit line pair. This results in faster access times. The static nature of the SRAM cell also eliminates the need for refresh operations. However, the static cell uses more transistors and therefore requires more area than a DRAM cell. As with the DRAM, the four primitive operations of an SRAM are sense, precharge, read, and write. However, because an SRAM storage cell operates so quickly, precharge and sense may be performed for each read (even within page). This is in contrast to DRAM devices (known as page-mode DRAM), which save time by storing a page of data in the device's sense amplifiers, as noted previously.
Read-only memory (ROM) cores store information according to an electrical connection at each cell site which join rows to columns. Typically, a single transistor forms the electrical connection at each cell site. There are a variety of ROM cell types, including erasable programmable ROM storage (EPROM), electrically erasable programmable ROM (EEPROM), flash ROM, and mask-programmable ROM. Their differences lie in the type of transistor used in each architecture's storage cell. However, ROMs share the storage array architecture illustrated in FIG. 2, which requires a row and column decode of the address for each data access.
Unlike SRAM and DRAM devices, not all ROM devices include sense amplifier circuits (e.g., sense amplifiers 215(1)-(N)). Sense amplifiers are only used in certain ROM architectures which require fast access times. For such ROM devices, the primitive operations are sense, precharge, and read. For slower ROM devices that do not use sense amplifiers, the selected data values are driven directly from the storage cell circuitry to output amplifiers, which in turn drive the data I/O bus. For these ROMs, the single primitive operation is read.
A significant limitation on the effective bandwidth of memory bus 50 (i.e., interconnect 110) can arise as the result of the issuance of certain combinations of read and write operations. For example, the issuance of certain read/write combinations may intrinsically introduce inefficiencies in the utilization of interconnect 110. For example, a delay (also known as a data bubble) may occur when a write operation is followed by a read operation. Because the write data is immediately present on interconnect 110 and the read data is not present until a later time (determined by the access time of the device being read), a data bubble between the write data and read data naturally occurs. This data bubble obviously impairs the efficient utilization of interconnect 110 and the column I/O datapath.
Moreover, because it is preferable to share certain interconnect resources 110, certain combinations of read and write operations are not allowable. These combinations result in data bubbles between the data transferred by certain of the read and write operations within these combinations. These delays, also known as data bubbles, are of particular importance in systems which are configured to maintain full or almost full utilization of interconnect 110 by constantly (or nearly constantly) transferring data to and from components attached thereto (e.g., CPU 14 and main memory 16), and within the memory devices which make up main memory 16.
In a conventional memory of the design shown in FIGS. 2 and 3, the resource ordering for read and write operations differs slightly. A read operation uses resources in the order:                control signal lines 112        column I/O datapath (including data I/O bus 185 and column I/O lines 266)        data signal lines 114while a write operation uses them in the order:        control signal lines 112        data signal lines 114        column I/O datapath (including data I/O bus 185 and column I/O lines 266)These differences in the ordering of resource usage give rise to resource conflicts when read and write operations are issued because control signals issued over control signal lines 114 cause data to be transferred immediately, in relative terms. Thus, if data signal lines 114 and the column I/O datapath are bidirectional (as is desirable), conflicts can occur between read data and write data because each transfer requires the use of these resources.        
What is therefore desirable is a protocol and apparatus that provide improved interconnect utilization. In particular, the protocol should permit read and write operations to be issued in any order without the need to delay one or more of the operations because of resource conflicts. Moreover, the apparatus should be configured to perform this function in the case of bidirectional interconnect and column I/O datapaths.