1. Field of the Invention
The present invention relates generally to a system and method for predecoding information from instructions for execution in a Superscalar computer. More particularly, the present invention relates to a system and method of predecoding instructions for superscalar-dependency information, as instructions are copied into cache, to increase the operating frequency of a superscalar processor.
2. Related Art
Processors used in conventional computer systems typically execute program instructions one at a time, in sequential order. The process of executing a single instruction involves several sequential steps. The first step generally involves fetching the instruction from a memory device. The second step generally involves decoding the instruction, and assembling any operands. The third step generally involves executing the instruction, and storing the results. Some processors are designed to perform each step in a single cycle of the processor clock. Alternatively, the processor may be designed so that the number of processor clock cycles per step depends on the particular instruction.
Modern computer systems commonly use an instruction cache to temporarily store blocks of instructions. Cache memories are buffers that hold information from main memory so that a processor can reach it quickly. Instructions are fetched from the instruction cache by the processor. Caches work on the principle that a processor is likely in the near future to need information in the same vicinity as the information it is working on at present. If the processor finds data it needs in cache (a cache hit), then the speed will be increased, since cache memories tend to be faster than main memory. However, if the processor does not find what it needs in cache (a cache miss), then the block containing the missing data must be brought in from main memory.
Fetching instructions from cache or memory is normally controlled by a program counter. The contents of the program counter typically indicate the starting address in cache from which the next instruction or instructions are to be fetched. Depending on the design of the processor, each instruction may have a fixed length, or a variable length. For example, a processor might be designed such that all instructions have a fixed length of 32 bits (4 bytes, or a "longword"). Fixed length instruction formats tend to simplify the instruction decode process.
Modern computers commonly use a technique known as pipelining to improve performance. Pipelining involves the overlapping of the sequential steps of the execution process. For example, while the processor is performing the execution step for one instruction, it might simultaneously perform the decode step for a second instruction, and perform a fetch of a third instruction. Pipelining can thus decrease the execution time for a sequence of instructions. Superpipelined processors attempt to further improve performance by overlapping the sub-steps of the three sequential steps discussed above.
Another technique for improving performance involves executing two or more instructions in parallel, simultaneously. Processors which utilize this technique are generally referred to as superscalar processors. The ability of a superscalar processor to execute two or more instructions simultaneously depends upon the particular instructions being executed. For example, two instructions which both require use of the same, limited processor resource (such as the floating point unit) cannot be executed simultaneously. This type of conflict is known as a resource conflict. Such instructions cannot be combined or "bundled" with each other for simultaneous execution, but must be executed alone, or bundled with other instructions. Additionally, an instruction which depends on the result produced by execution of a previous instruction cannot be bundled with that previous instruction. The instruction which depends on the result of the previous instruction is said to have a data dependency on the first instruction. Similarly, an instruction may have a procedural dependency on a previous instruction, which prevents the two instructions from being bundled. For example, an instruction which follows a branch instruction cannot be bundled with the branch instruction, since its execution depends on whether the branch is taken.
The superscalar processor must therefore be able to determine whether two or more given instructions can be bundled. Since this determination cannot be made without first decoding the instructions, the determination is commonly made by the instruction decode unit as instructions are fetched from cache. Advanced compiler techniques are used to assist the instruction decode unit to determine (as instructions are fetched from cache)--"on the fly"--whether two or more instructions can be executed in parallel by the execution unit.
As the processor decodes instructions from cache there are many possible penalties that can be incurred. One such penalty occurs during an instruction cache-miss. A cache-miss delays execution time significantly, since instructions must be fetched from main memory (which is much slower than cache) and then decoded. Additionally, decoding "on the fly" significantly slows the speed of the execution unit, since the execution unit must wait for the instruction decode unit (with the aid of compilers and software in some systems) to decide if there are any data dependencies, procedural dependencies and/or resource conflicts (before dispatching instructions for optimal simultaneous execution by the execution unit).
To speed-up execution time, some compiler systems attempt to gather information regarding the feasibility of grouping instructions for simultaneous dispatch to the execution unit, prior to the instructions being fetched from instruction cache. This aids in potential simplification and speed of instruction decode hardware.
To gather information prior to instructions being fetched from cache some conventional superscalar processor system architectures utilize software compilers when generating machine instructions from source code to determine in advance of fetching from cache whether groups of instructions can be dispatched simultaneously to the processor functional units. These conventional systems then encode one or more bits in the actual instruction operational code (op-code) where this encoded information (in the op-code) can be utilized by the instruction decode hardware.
There are a number of disadvantages associated with the aforementioned "predecode" techniques. First, the predecode information is employed as part of the instruction set architecture. This means that every possible processor implementation of this architecture must interpret the information identically to have old code perform optimally on new machines (in other words, to maintain object-code compatibility). Therefore, flexibility for every possible processor implementation to optimize the number and encoding of the predecoded information in op-code is sacrificed.
Second, performance improvements in superscalar instruction execution can only be realized on code which was generated with compilers that are modified to correctly predecode instructions and encode the op-code bits correctly. This is a disadvantage in cases where existing object code does not utilize the feature. Additionally, this adds complexity to compiler software.
Third, the aforementioned predecode techniques require using bits in the actual instruction op-code. This reduces the amount of information that can otherwise be encoded severely restricting how many bits of predecoded information can practically be used by the system.