1. Field of the Invention
The present invention relates to a method of and an apparatus for supplying instructions to a processor.
2. Description of the Prior Art
Recent high-performance processors employ many functional blocks to improve the performance thereof. These functional blocks must efficiently operate to increase the operation efficiency and performance of the processors.
To efficiently operate the functional blocks, it is necessary to let them simultaneously execute as many instructions as possible. The degree of simultaneous execution of instructions is called ILP (instruction level parallelism). The higher the ILP of functional blocks, the higher the performance of the processors that employ the functional blocks.
The ILP is greatly dependent on the size of a basic block of instructions. The basic block is a string of instructions that include no branch instruction or branch target instruction. A program to be executed by a processor is divided into basic blocks according to branch and branch target instructions. Generally, a larger basic block results in a higher ILP.
FIG. 13 is a block diagram showing a RISC (reduced instruction set computer) processor according to a prior art.
This processor consists of an instruction cache 1 for temporarily storing instructions, a data cache 3 for temporarily storing operand data, a main memory 5 for storing programs and data, a program counter 7 for latching the address of an instruction to execute, an instruction decoder for decoding an instruction and generating various control signals, functional units 11 for carrying out various operations, and a register file 13 for temporarily storing data used by the functional units 11.
The processor uses an address in the program counter 7 to access the instruction cache 1 and fetch instructions therefrom. The fetched instructions are sent to the instruction decoder 9, which decodes the instructions. According to the decoded instructions, the functional units 11 carry out operations with the use of data stored in the register file 13. When needing data stored in the main memory 5, the processor accesses the data cache 3 and transfers the needed data from the data cache 3 to the register file 13.
Most of recent programs except numerical calculation programs involve many branch instructions to reduce the size of each base block of instructions. It is difficult, therefore, to increase the ILP.
The instruction cache 1 contains cache lines so that instructions in one cache line may simultaneously be read out of the instruction cache 1. It is impossible to simultaneously read two instruction strings belonging to two base blocks that are stored in two different cache lines in the instruction cache 1. In this case, the two instruction strings must separately be read, rearranged, and combined.
Accordingly, the conventional processor is very low in instruction supply efficiency, and therefore, is incapable of realizing high ILP and high performance.
To solve this problem, T. M. Conte et al has proposed a collapsing buffer in "Optimization of Instruction Fetch Mechanism for High Issue Rate," Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 333-344, 1995. This technique simultaneously supplies an instruction string made of several base blocks to a processor. This is impossible for the prior art of FIG. 13 to achieve.
If one cache line includes a base block serving as a branch source and another includes a base block serving as a branch target, the collapsing buffer simultaneously accesses the two cache lines, removes instructions that are not executed from the instruction strings in the cache lines, and provides an instruction string made of the two base blocks that are continuous to each other.
Techniques similar to the above disclosure have been proposed by S. Dutta et al in "Block-Level Prediction for Wide Issue Superscalar Processors," Proceedings of 1st International Conference on Algorithms and Architectures for Parallel Processing, pp. 143-152, 1995, by A. Seznec et al in "Multiple-Block Ahead Branch Predictors," Proceedings of Architectural Support for Programming Languages and Operations Systems, 1996, and by S. Wallace in "Instruction Fetching Mechanism for Superscalar Microprocessors," Proceedings of Euro-Par, 1996.
Each of these techniques predicts the address of a branch target instruction, simultaneously accesses cache lines that store a branch source instruction string and a branch target instruction string, and provides an instruction string made of two base blocks.
FIG. 14 is a block diagram showing a processor having the collapsing buffer, according to a prior art. The same parts as those of FIG. 13 are represented with the same reference marks and are not explained again.
The collapsing buffer 15 is arranged between an instruction cache 1 and an instruction decoder 9. A branch target buffer 17 receives an address from a program counter 7, and if an instruction specified by the address is a branch instruction, predicts the address of a branch target instruction. The branch target buffer 17 then provides the predicted address.
The instruction cache 1 receives the address from the program counter 7 as well as the predicted address from the branch target buffer 17, and the collapsing buffer 15 fetches two instruction strings from the addresses and rearranges and combines them.
The operation of rearranging and combining instruction strings by the collapsing buffer 15 will be explained in detail.
FIG. 15 shows the collapsing buffer 15 and the periphery thereof. The instruction cache 1 is divided into banks 19 and 21 each having access ports so that instructions having continuous addresses over two banks may simultaneously be accessed. Namely, two continuous base blocks may be accessed at the same time.
The collapsing buffer 15 consists of an interchange switch 23, a first instruction buffer 25, and a second instruction buffer 27. The interchange switch 23 receives two instruction strings from the banks 19 and 21 and rearranges them in order of execution. The first instruction buffer 25 specifies actually executed instructions in the rearranged instruction strings. The second instruction buffer 27 receives a string of the specified instructions and transfers it to the instruction decoder 9.
If the program counter 7 specifies an address where an instruction "c" is stored, a cache line containing a string of instructions a, b, c, and d including the instruction c in question is supplied from the bank 19 to the interchange switch 23. If the instruction "d" in this instruction string is a branch instruction, the branch target buffer 17 receives the address of the instruction d and determines a branch target address jumped from the branch instruction d.
If a branch target instruction at the branch target address is an instruction "f," a cache line containing a string of instructions e, f, g, and h including the instruction f in question is supplied from the bank 21 to the interchange switch 23. The interchange switch 23 rearranges the instruction string of a, b, c, and d and the instruction string of e, f, g, and h. Even if the addresses of the instructions a, b, c, and d are larger than those of the instructions e, f, g, and h, the interchange switch 23 rearranges them in order of a, b, c, d, e, f, g, and h because this is their execution order.
Then, the interchange switch 23 sends the instruction string of a to h to the first instruction buffer 25. The first instruction buffer 25 determines instructions to be executed actually. The actually executed instructions are c, d, f, and g, which are transferred to the second instruction buffer 27. The second instruction buffer 27 supplies these instructions to the instruction decoder 9.
The branch target buffer 17 receives an address from the program counter 7, and if the address corresponds to a branch instruction, predicts the address of a branch target instruction and issues the branch target address.
FIG. 16 shows the structure of the branch target buffer 17.
The branch target buffer 17 consists of an instruction address tag unit 29 for comparing an address in the program counter 7 with tag addresses, and a branch target address unit 31 for storing branch target addresses corresponding to the tag addresses.
The branch target buffer 17 compares an address in the program counter 7 with each of the tag addresses stored in the instruction address tag unit 29, and if the address in the program counter 7 agrees with any one of the tag addresses, provides a branch target address corresponding to the agreed tag address from the branch target address unit 31.
The instruction supply technique using the collapsing buffer has a problem of slowing a processor speed. This is because materializing this technique needs a plurality of branch target address prediction mechanisms such as branch target buffers and requires an instruction cache to be divided into banks. This results in increasing the quantity of hardware. In addition, instruction arranging mechanisms and instruction string combining mechanisms such as the collapsing buffer are necessary.
These hardware pieces have very intricate structures to elongate an instruction supply time. Since the speed of a processor is dependent on the instruction supply time, a long instruction supply time slows down the processor speed.
A slowdown in the processor speed may be stopped by increasing the number of pipeline stages. This, however, deteriorates the performance of the processor itself.
In the processor of FIG. 14, a process of picking up a branch target instruction out of the instruction cache 1 according to a branch target address provided by the branch target buffer 17 needs a certain time. This time is substantially equal to a usual time of picking up an instruction out of the instruction cache 1 according to an address provided by the program counter 7. If the address in the program counter 7 represents a branch instruction, a time that is twice as long as the usual time is needed to fetch a branch target instruction from the instruction cache 1. Namely, a route passing through the program counter 7, branch target buffer 17, and instruction cache 1 deteriorates the processor speed.
Further, inserting the collapsing buffer 15 between the instruction cache 1 and the instruction decoder 9 slows down the processor speed.
In this way, the instruction supply technique of the prior art deteriorates the operation speed and performance of a processor.