1. Field of the Invention
The present invention relates to the field of semiconductor memory devices and, more particularly to a very wide, very fast distributed memory and a method of constructing the same.
2. Description of the Related Art
In certain processor-based applications there is a need for a very wide memory having relatively few memory locations that are distributed throughout a single chip. For example, in a single instruction, multiple data (SIMD) massively parallel processor (MPP) array, a very large number of processing elements (PEs) each typically contain a register file. The register file typically contains only a few memory words (e.g., sixty-four) organized as a single column of one word rows. Because all of the PEs execute the same instruction at a time, they all use the same address to access its respective register file. In addition, each register file is read from or written to at essentially the same time. In effect, the distributed register files act as a single, very wide memory device.
It is impractical to implement this very wide memory as a single random access memory (RAM) array core. Such a large memory would be very slow and the routing difficulties associated with connecting thousands of data lines through the chip would be formidable. Therefore, several smaller memory cores are needed, with each core serving a small group of PEs. The use of several smaller memory cores, however, is not without its shortcomings. For instance, the address decoding logic responsible for decoding an address and selecting the appropriate word to be accessed from the memory array has to be repeated for every core, which takes up precious space on the chip.
A normal memory core 10 is illustrated in FIG. 1. A decode circuit 12 is positioned to one side of the memory bit array 20 and sense amplifiers and other select logic 30 are positioned beneath the array 20. Note that the address lines 14 are driven in vertically, along the length of the decoder circuit 12, to the decode logic 16 within the decode circuit 12. The address lines 14 are decoded by the decode circuit 12 and converted into a word line number/address corresponding to one of the word lines 18 in the core 10. A word select signal is then driven across the word line 18 and through the memory array 20 to activate the appropriate word or row of memory within the array 20.
For a read operation, the activated row couples all of the memory cells corresponding to the word line 18 to respective bit lines 22, which typically define the columns of the array 20. It should be noted that a register file typically consists of a single column and that column address decoding is typically not required. For a dynamic random access memory (DRAM), when a particular row is activated, the sense amplifiers 30 connected to the bit lines 22 detect and amplify the data bits transferred from the array 20 by measuring the potential difference between the activated bit lines 22 and a reference line (which may be an inactive bit line). As is known in the art, for a static random access memory (SRAM), the sense amplifier circuitry 30 would not be required. The read operation is completed by outputting the accessed data bits over input/output (I/O) lines 32.
Since the typical memory core 10 contains the decode circuit 12 and performs the address decode operation as part of the memory access operation (e.g., data read or write), the core 10 has a relatively long access time. FIG. 2 illustrates an example of a timing diagram for the conventional memory core 10 illustrated in FIG. 1. For this example it is presumed that the memory core 10 is a SRAM device. The core 10 is driven by a clock signal CLOCK, and the read operation begins at time to and ends at time t1. The typical access time taccess for the conventional memory core 10 includes the time required for the memory core circuitry to properly latch the address signals thold (often referred to as the xe2x80x9chold timexe2x80x9d), the time required to decode the address lines tadec, the time required to drive the corresponding word line(s) twrd, the time required to drive the bit lines tbit, and the time required by the output logic to output the accessed information top. Thus, for the conventional memory core 10 (FIG. 1), the access time taccess is calculated as follows:
taccess=thold+tadec+twrd+tbit+top.xe2x80x83xe2x80x83(1)
It is desirable to reduce the access time taccess of the memory core so that the core could be used in a very wide, very fast, distributed memory device. It is also desirable to reduce the access time taccess of the memory core so that the core could be used as a very wide, very fast, distributed register file in a SIMD MPP device.
Accordingly, there is a desire and need for a memory core having a substantially reduced access time so that the core can be implemented in a very wide, very fast, distributed memory device.
The present invention provides a memory core having a substantially reduced access time.
The present invention also provides a very wide, very fast, distributed memory device.
The present invention also provides a very wide, very fast, distributed register file in a SIMD MPP device.
The above and other features and advantages of the invention are achieved by providing a memory core with an access time that does not include a delay associated with decoding address information. Address decode logic is removed from the memory core and the address decode operation is performed in an addressing pipeline stage that occurs during a clock cycle prior to a clock cycle associated with a memory access operation for the decoded address. After decoding the address in a first pipeline stage, the external decode logic drives word lines connected to the memory core in a subsequent pipeline stage. Since the core is being driven by word lines, the appropriate memory locations are accessed without decoding the address information within the core. Thus, the delay associated with decoding the address information is removed from the access time of the memory core.