1. Technical Field of the Invention
The present invention relates to bypass networks in microprocessors and, more particularly, to bypass networks including multiported bypass caches.
2. Description of Prior Art
In a superscaled microprocessor, pieces of data are stored in a register file to be available for use by execution units which are associated with pipelines. It can take four or more clock cycles for a piece of data produced by an execution unit to be written to the register file and then read from the register file to be available for the same or another execution unit. The delay in the availability of the data is referred to as latency. However, the same or another execution unit may need the piece of data before it is available, perhaps even at the very next cycle. If the required piece of data is not available, the execution unit may be idle or otherwise inefficiently used while waiting for the piece of data. The problem is exasperated by an increased number of pipeline stages associated with increased clock frequencies and sizes of internal memory.
As a partial solution to the problem, bypass networks have been employed to hold the piece of data for several clock cycles. The contents of the bypass network is more immediately available to the execution unit than is the contents of the register file, thereby reducing waiting by the execution unit. For example, referring to FIG. 1, a prior art bypass unit 10 includes an array of shift register data latches DL1, DL2, DL3, and DL4 that receive pieces of data from an execution unit. There are corresponding shift register address latches AL1, AL2, AL3, and AL4. Each piece of data is assigned an address in the register file. The address in address latch AL1 is the address assigned to the data in data latch DL1. Likewise, the addresses in address latches AL2, AL3, and AL4 are the addresses assigned to the data in data latches DL2, DL3, and DL4, respectively. The addresses in address latches AL1, AL2, AL3, and AL4 are referred to as destination addresses.
Just prior to a piece of data being written into data latch DL1, the data in DL3 is shifted into DL4, the data in DL2 is shifted into DL3, and the data in DL1 is shifted into DL2. Likewise, the address in AL3 is shifted into AL4, the address in AL2 is shifted into AL3, and the address in AL1 is shifted into AL2. The address assigned to the data written into DL1 is written into AL1. Shifting (from AL3 to AL4, AL2 to AL3, and AL1 to AL2) may occur with each clock cycle.
The pieces of data in data latches DL1, DL2, DL3, and DL4 are selectively supplied to a multiplexer (MUX) 14 through a group of conductors 16, 18, 20, and 22. It will be appreciated that each of conductors 16, 18, 20, and 22 comprises numerous parallel conductors. The particular piece of data that is passed by MUX 14 to conductors 26 is controlled by the state of signals on conductors 30, 32, 34, and 36. The state of the signals on conductors 30, 32, 34, and 36 is controlled by comparators 40, 42, 44, and 46.
For example, assume that a piece of data X is contained in data latch DL2 and that data X is assigned an address 000110. Accordingly, 000110 will be stored in address latch AL2. Because each piece of data is assigned a different address in the register file, address latches AL1, AL3, and AL4 will not contain 000110. If the microprocessor scheduler determines that data X is to be provided to conductors 26, the value 000110 is written as a source address to a conductor 48. The value 000110 is passed to each of comparators 40, 42, 44, and 46, where it is compared with the addresses in address latches AL1, AL2, AL3, and AL4, respectively. Because the contents of address latch AL2 matches the value on conductor 48, a signal on conductor 32 is asserted, while the states of conductors 30, 34, and 36 remain deasserted. Accordingly, MUX 14 passes data X from data latch DL2 on conductors 18 to conductors 26.
There are, however, significant problems with the use of bypass units such as bypass unit 10. First, with each clock cycle, data and addresses are shifted. Over time, this consumes an appreciable amount of power.
Second, such bypass units take up a relatively large amount of microprocessor real estate. The fan-in on MUX 14 is at least as great as the product of the number of data latches and the number of bits per piece of data. Typically, the number of data latches in a bypass unit is at least equal to the number of cycles of the write-read latency. Further, bypass unit 10 holds pieces of data for only a single execution unit. The total real estate increases with the number of execution units.
Accordingly, there is a need for a bypass network that efficiently uses power and microprocessor real estate, yet provides execution units with ready access to pieces of data.