This invention relates generally to computer organization and more particularly to a register file and pipeline organization in a computer architecture having a large number of registers.
A typical multiported register file 10 is shown in FIG. 1. The register file 10 includes N registers each having M read ports and at least one write port. Coupled to the register file 10 are instruction decoders 12 which decode instructions held in a number L of instruction registers 14. Typically there are two read ports for each instruction register, i.e., M=2.times.L, to allow both source operands to be fetched simultaneously. The plurality of registers 14 include L registers, with each register being associated with a corresponding functional unit (not shown). This organization is typical for a superscalar architecture or a very long word instruction word (VLIW) architecture, wherein each instruction register 14 is associated with a corresponding functional unit. The decoders 12 decode the register fields of the instruction registers 14 and select the corresponding register in the register file 10. Also coupled to the register file 10 are a plurality of registers 16. Each of the registers 16 is coupled to a respective one of the output ports or read ports of the register file 10.
A detailed schematic of an individual register cell 18 of the register file 10 is shown in FIG. 2. The cell 18 includes two inverters I1 and I2 connected in a circular fashion to form the basis of the register cell. The register cell of FIG. 2 includes two read ports (P1 and P2) and a single write port (W). The write port includes pass transistor 20 connected between a write bit line Bit Line W and an input of the register cell. The first read port includes transistors 22 and 26 and the second read port includes transistors 24 and 25, each port being connected in a conventional manner, as is known in the art. It is apparent that with the addition of each read port the size of the register cell increases. This increased size of the register cell increases the access time of the overall register file due to the increase in capacitance and resistance of the individual cells. This problem is exacerbated if the number of registers in the register file is relatively large as well because of the increased capacitance and resistance of the bit lines in the register file. In fact, it can be shown that the access time is a quadratic function of the number of functional units and the number of registers.
Simulations of the register file of FIGS. 1 and 2 demonstrate the relationship of the access time of the register file as a function of the number of functional units and the number of registers in the register file. The results of these simulations are shown in FIG. 3. In FIG. 3, the access time of the register file (T.sub.ACCESS) is plotted as a function of the number of functional units for a number of different sized register files. The access time as a function of the number of functional units for register files having 32, 64, 128, 192 and 256 is shown in plots 32, 34, 36, 38 and 40, respectively, in FIG. 3.
In a microprocessor having a pipelined architecture, the cycle time, the time allocated to the execution of each pipestage, is determined by the operation that must be performed in a single stage that has the longest time interval. Because each cycle of the microprocessor typically has the same time duration, the cycle time cannot be less than the operation having the longest time interval. The operational path in a microprocessor associated with the longest execution time interval is therefore referred to as the critical path of the microprocessor.
In the past, the critical path in a microprocessor has been associated with a functional unit in the processor, such as an arithmetic logic unit (ALU) which may require a relatively long period of time to perform a complex operation upon data. However, as the number of functional units and/or number of registers within microprocessors increases, the access time of the register file T.sub.ACCESS can become the critical path of the microprocessor. For example, assuming a critical path of two nanoseconds, for a register file having 128 registers, a computer architecture having over four functional units will result in the register file becoming the critical path in the computer. This relationship is shown in plot 36 of FIG. 3. Superscalar or VLIW architectures are capable of supporting significantly more functional units than four. As a result, the access time for the register file in superscalar or VLIW architectures can become a significant obstacle to achieving very fast cycle times.
One approach to alleviating the time required for accessing the register file has been to divide the register file. FIG. 4 illustrates an example of a divided register file in a pipelined microprocessor architecture. Instruction registers 42 and 72 each receive an instruction for execution which can include an access to a register in either cell array 50 or cell array 60. Row decoders 44 and 74 decode the instruction in registers 42 and 72. A successful decode in either row decoder 44 or 74 will result in a register word line output to word line driver 46 or 76, respectively. Word line drivers 46 and 76, in turn, drive the word line corresponding to the selected register in the corresponding cell arrays 50 or 60 in order to access a cell 52 or 62 within arrays 50 and 60, respectively. Only one of cell array 50 or cell array 60 will typically be accessed in a given pipeline cycle and enable logic 92, which receives the word line outputs (or register selection signals) from row decoders 44 and 74, enables one of word line drivers 54 and 64 to drive GLOBAL BIT LINE.
When cell 52 or 62 is activated responsive to the word line output from drivers 46 and 74, then the cells will drive their data onto the corresponding LOCAL BIT LINE, which is typically relatively long resulting in a high capacitance and slow response, and into the corresponding sense amplifier 54 or 64, respectively. The sense amps 54 and 64, only one of which is active at a given time, drive the data from cell 52 or 62 onto the GLOBAL BIT LINE which is input to bypass multiplexor (MUX) 80. The GLOBAL BIT LINE is also typically long, since the bypass MUX can be located at a significant distance from the register file, and therefore has a high capacitance and relatively slow response. Finally, the output of bypass MUX 80 is captured by pipestage register 90 for output, during the next pipestage, to a functional unit for execution of an operation upon the register data captured by the pipestage register.
The instruction registers 42 and 72 and the pipeline register 90 are each clocked as part of the instruction pipeline and represent pipestages in an instruction pipeline. Therefore, the pipestage delay for a register file access in the circuit of FIG. 4, using the path from instruction register 42 to pipestage register 90 for example, is composed of the accumulated delays of row decoder 44, word line driver 46, cell 52, the LOCAL BIT LINE for cell 52, sense amp 54, the GLOBAL BIT LINE and bypass MUX 80 plus the set-up time for pipestage register 90.
A simplified example of a succession of pipestages in the register organization of FIG. 4 is shown in FIG. 6A. Each register access stage REG ACCESS produces the data required for a subsequent execution stage EXECUTE. Once the microprocessor pipeline is full, then the register access stage for the next execution stage takes place concurrently with the EXECUTION stage for the current instruction, as demonstrated in the time intervals from T1 to T2, T2 to T3, and T3 to T4 which correspond to cycle times of the microprocessor pipeline. In the interval from T1 to T2, the REGISTER ACCESS pipestage for a second instruction in an execution sequence INSTR2 takes place at the same time that the EXECUTION pipestage is performed for a preceding instruction INSTR1 in the execution sequence.
When the register access is the critical path in the execution pipeline, then the cycle time can be no less than the time required for a register access stage which includes the delays of all the components in the path through the register file, as discussed above. FIG. 6A is a simplified representation of a pipestage sequence. There are typically other pipestages for other operations, such as an instruction fetch pipestage.
There are a variety of ways of constructing a divided register file which may omit or add certain elements or combine the elements somewhat differently. Commonly assigned U.S. Pat. No. 5,513,363 illustrates another example of a divided register file solution in a pipelined architecture. However, despite the access time reductions obtained through subdivision of register files, register files continue to grow in size and, accordingly, continue to represent a limitation on the minimum cycle time in microprocessors.
Accordingly, a need remains for lowering the cycle time in a pipelined computer architecture having a register file and multiple functional units.