The present invention concerns predecoding and steering instructions executed in a superscalar processor.
Most modern computer systems include a central processing unit (CPU) and a main memory. The speed at which the CPU can decode and execute instructions and operands depends upon the rate at which the instructions and operands can be transferred from main memory to the CPU. In an attempt to reduce the time required for the CPU to obtain instructions and operands from main memory, many computer systems include a cache memory between the CPU and main memory.
A cache memory is a small, high-speed buffer memory which is used to hold temporarily those portions of the contents of main memory which it is believed will be used in the near future by the CPU. The main purpose of a cache memory is to shorten the time necessary to perform memory accesses, either for data or instruction fetch. The information located in cache memory may be accessed in much less time than information located in main memory. Thus, a CPU with a cache memory needs to spend far less time waiting for instructions and operands to be fetched and/or stored.
A cache memory is made up of many blocks of one or more words of data. Each block has associated with it an address tag that uniquely identifies which block of main memory it is a copy of. Each time the processor makes a memory reference, an address tag comparison is made to see if a copy of the requested data resides in the cache memory. If the desired memory block is not in the cache memory, the block is retrieved from the main memory, stored in the cache memory and supplied to the processor. A cache memory used to store instructions is generally referred to as an instruction cache. A program counter is used to determine which instructions are to be fetched for execution.
In some computer systems, parallel execution of instructions (called "bundling" of instructions) may be utilized to speed up computer operation. Processors which provide for parallel execution of instructions can be referred to as superscalar processors. Superscalar computers generally utilize more than one execution unit to provide for bundling of instructions. An execution unit is, for example an arithmetic logic unit (ALU) or a floating point unit (FPU).
Even with multiple execution units, there are still limitations to which instructions may be bundled. For example, some instruction may have conflicts with other instructions. The type of conflict can take various forms. A resource conflict occurs when two instructions both use the same, limited processor resource. This may occur, for example, when both instructions require use of the same execution unit. Alternately, data dependency may result in a conflict. That is, when one instruction produces a result to be used by a next instruction, the two instructions cannot be bundled. Also, a procedural dependency may result in a conflict. For example, an instruction which follows a branch instruction cannot be bundled with the branch instruction, since execution of the instruction depends on whether the branch is taken. In order to determine whether two or more given instructions can be bundled, it is generally necessary to first decode the instructions. This may be done, for example by an instruction decode unit.
Various methods have been advanced for minimizing the performance penalty for decoding and steering instructions to the proper execution unit. For example, compiler techniques may be used to assist the instruction decode unit to determine whether two or more instructions can be bundled. That is, during compile time, the compiler can encode one or more bits in the actual instruction operational code (op-code) to be utilized by the instruction decode/steering hardware. These bits can provide information to the decode hardware as to how the instruction may be bundled with other instructions. The predecode information, in effect, is employed as part of the instruction set architecture. However, the information needed by the decode hardware is processor dependent; therefore, such an encoding of bits can limit the flexibility of different processors to optimally execute op-code without a code recompile.
In one system, a dedicated predecoded bit is stored in the instruction cache which is used by decode hardware to steer instructions to either an integer arithmetic logic unit (ALU) or a floating point unit (FPU). See, for example, E. DeLano, W. Walker, J. Yetter, M. Forsyth, "A High Speed Superscalar PA-RISC Processor", IEEE, 1992, pp. 116-121.