1. Field of the Invention
The present invention relates to the field of computers, and particularly to a method and apparatus for selectively enabling individual sets of registers in a row of a register array.
2. Description of the Related Art
Historically, simple microprocessors have processed instructions one after another. For instance, in one instruction processing pipeline, simple microprocessors often sequentially (1) fetch an instruction, (2) fetch an operand for that instruction, (3) execute the instruction, and (4) writeback the result of the execution to a register. In addition, to further increase their instruction processing speed, a number of prior art single pipeline processors overlap their pipeline stages so that they can operate on several instructions simultaneously.
Furthermore, in recent years, an increasing number of processors have further increased their instruction processing speed by simultaneously processing several instructions in several parallel instruction processing pipelines. These processors are referred to as superscalar processors, in which N parallel instruction pipelines provide the ability to execute N instructions simultaneously (where N is an integer representing the number of parallel pipelines). In order for superscalar processors to operate efficiently, these processors have an instruction fetch unit which, during the instruction fetch stage, can provide its N instruction pipelines with N instruction words, by (1) supplying an instruction pointer identifying the starting address of the N instruction words to an instruction memory (such as an instruction cache), and (2) retrieving the N instructions words from the instruction memory.
Unfortunately, many prior art superscalar processors do not fully take advantage of their increased instruction processing speed (i.e., do not take advantage of their parallel processing capability) because they have slow instruction retrieval speeds as they use instruction memory arrays that have slow instruction outputting speeds. Prior art instruction memory arrays have slow instruction outputting speeds because they cannot simultaneously enable in one clock cycle all N sets of registers that store the requested N instruction words (i.e., cannot in one clock cycle cause all the registers in the N sets of registers to output their data on their differential output bit line pairs).
FIG. 1 presents one prior art instruction memory array used in prior art superscalar processors. As shown in this figure, prior art instruction memory array 100 is a MOS cache that has a one word width (i.e., it is a cache which stores one and only one instruction word per row). Memory array 100 has a slow instruction outputting speed, because only one word can be enabled per clock cycle by row decoder 105 (i.e., only one set of registers can be forced to output their contents in a clock cycle).
FIGS. 2A and 2B present another prior art instruction memory used in prior art superscalar processors. As shown in this figure, instruction memory array 200 is a MOS cache which has a width of four words (i.e., each row of cache 200 contains four words). This prior art instruction memory array has a faster instruction outputting speed than instruction memory 100 of FIG. 1, because (unlike memory 100 which only enables one word per clock cycle) row decoder 205 enables four words per clock cycle by providing an enable signal on one of the row enable lines X.
However, even this prior art memory array does not provide the fastest instruction accessing speed because row decoder 205 requires two clock cycles to enable a set of data words on two different rows. In other words, instruction memory array 200 does not have an optimal instruction outputting speed because it cannot enable non-overlapping data word sets on two different rows in one clock cycle. Consequently, prior art superscalar processors that utilize either cache 100 or cache 200 do not fully take advantage of their increased instruction processing speed because they have slow instruction retrieval speeds as they use instruction memory arrays that have slow instruction outputting speeds.
In the prior art implementation of FIG. 2A, maximum throughput for delivering a set of N instruction words occurs only when the instruction pointer (address) of the set is "aligned" to the physical address boundary; that is, all instructions of the set are located on the same row (e.g., all N commands are located on row X0).
In the case, that instruction pointer does not occur at perfect physical alignment (i.e., unaligned) the set N occurs along two physical rows (e.g., X0 and X1). In the instruction memory array of FIG. 2A, the selector block is only able to select one physical row in an access cycle, hence accessing "unaligned" data sets will take two access cycles, thereby reducing by one-half the access speed.
Therefore, there is a need for an optimal data dispatch method and apparatus for allowing an N-word wide memory array to simultaneously enable a non-overlapping set of registers that store N words on un-aligned addresses. In addition, there is a more specific need for a method and an apparatus for reducing a superscalar processor's instruction retrieval time (i.e., the time needed to obtain N requested instruction words from an instruction memory) and thereby taking advantage of the parallel pipeline processing capability of superscalar processors. In other words, to improve further the instruction processing efficiency of superscalar processors, there is a need for an N-wide register array (i.e., a register array that has N sets of registers in a row of registers) that can quickly enable N non-overlapping sets of registers when it receives an instruction pointer identifying the N instruction words, regardless of the address alignment (e.g., unaligned address). Consequently, a method and apparatus that in one clock cycle can issue the instruction word, pointed to by the instruction pointer and the N-1 instruction words that follow it, is desirable.
FIGS. 3A and 3B depict an embodiment of an SRAM cache memory that enables unaligned addressing of four-word data sets. This unaligned data fetching is enabled by employing four word lines along a physical row in the data array and selecting the appropriate word lines across two contiguous rows to select the unaligned four word set. This embodiment bears the overhead of requiring four separate physical word lines (WLO-WL3) to pass through each row in the array, as well as, circuitry in the select blocks for separately selecting the appropriate word lines among one or two contiguous rows.
Another method to enable the unaligned access function would be to use a single physical word line, but in conjunction with N times as many selector blocks (i.e., x-decoder blocks). In the memory array of FIG. 3A, four selector blocks are required. The penalty to be paid in this case would be the large area required to implement the multiple row selector blocks.
FIGS. 3A and 3B illustrate two prior art circuits for accessing data sets that fall along an unaligned address. A problem for the first circuit shown in FIG. 3A, is the need for four word lines running through the memory cell. Accordingly, in almost every conceivable implementation, this prior art approach increases the size of the memory cell, and correspondingly increases the cost to manufacture this cell. Another problem with this approach is the slightly lower memory access time due to a higher parasitic loadings stemming from the larger memory cell size.
The second circuit, illustrated in FIG. 3B, employs four independent selector blocks, in conjunction with a typical single word line memory cell. This approach also suffers from a massive area overhead involved by using multiple selector blocks. Accordingly, this increased layout area increases the cost of manufacture and decreases memory access performance.