1. Field of the Invention
The present invention relates generally to a system and method for fetching instructions from a memory system for execution in a computer. More particularly, the present invention relates to a system and method for selecting and buffering pairs of instructions from cache for simultaneous execution by a central processing unit.
2. Related Art
Processors used in conventional computer systems typically execute program instructions one at a time, in sequential order. The process of executing a single instruction involves several sequential steps. The first step generally involves fetching the instruction from a memory device. The second step generally involves decoding the instruction, and assembling any operands. The third step generally involves executing the instruction, and storing the results. Some processors are designed to perform each step in a single cycle of the processor clock. Alternatively, the processor may be designed so that the number of processor clock cycles per step depends on the particular instruction.
Modem computer systems commonly use an instruction cache to temporarily store blocks of instructions before execution. Instructions are then fetched from the instruction cache by the processor. The fetching process is normally controlled by a program counter. The contents of the program counter typically indicate the starting address in cache or memory from which the next instruction or instructions are to be fetched. Depending on the design of the processor, each instruction may have a fixed length, or a variable length. For example, a processor might be designed such that all instructions have a fixed length of 32 bits (4 bytes, or a "longword"). Fixed length instruction formats tend to simplify the instruction decode process.
Computer systems are commonly designed such that the processor can only fetch instructions from memory or cache in blocks which fall on certain memory boundaries. For example, a computer system might be designed such that all instruction fetches consist of 32-bit reads from longword-aligned locations in memory. Memory for such a system can be thought of as being divided into fixed, 32 bit blocks, which can only be accessed by the processor one at a time. Thus, the location within memory where an instruction resides can affect the time required to fetch the instruction. In the example system above, if a 32 bit instruction does not fall on a longword boundary (and is thus "misaligned"), the processor must fetch two longwords in order to obtain the instruction.
Modern computers commonly use a technique known as pipelining to improve performance. Pipelining involves the overlapping of the sequential steps of the execution process. For example, while the processor is performing the execution step for one instruction, it might simultaneously perform the decode step for a second instruction, and perform a fetch of a third instruction. Pipelining can thus decrease the execution time for a sequence of instructions. Superpipelined processors attempt to further improve performance by overlapping the sub-steps of the three sequential steps discussed above.
Another technique for improving performance involves executing two or more instructions in parallel, simultaneously. Processors which utilize this technique are generally referred to as superscalar processors. The ability of a superscalar processor to execute two or more instructions simultaneously depends upon the particular instructions being executed. For example, two instructions which both require use of the same, limited processor resource (such as the floating point unit) cannot be executed simultaneously. This type of conflict is known as a resource conflict. Such instructions cannot be combined or "bundled" with each other for simultaneous execution, but must be executed alone, or bundled with other instructions. Additionally, an instruction which depends on the result produced by execution of a previous instruction cannot be bundled with that previous instruction. The instruction which depends on the result of the previous instruction is said to have a data dependency on the first instruction. Similarly, an instruction may have a procedural dependency on a previous instruction, which prevents the two instructions from being bundled. For example, an instruction which follows a branch instruction cannot be bundled with the branch instruction, since its execution depends on whether the branch is taken.
The superscalar processor must therefore be able to determine whether two or more given instructions can be bundled. Since this determination cannot be made without first decoding the instructions, the determination is commonly made by the instruction decode unit of the processor.
Computer systems that are capable of simultaneous execution of a bundle of instructions are especially vulnerable to instruction misalignment. Even if two instructions can otherwise be bundled for simultaneous execution, if the two instructions do not fall on the necessary boundary within cache or memory, the two instructions cannot be fetched simultaneously, and cannot be executed simultaneously. Thus, misalignment of bundles of instructions can prevent the performance benefits of a superscalar processor from being achieved.
Some superscalar systems are designed to allow execution of aligned as well as misaligned instruction bundles. Typically, these conventional systems incur penalty cycles when switching from aligned bundles to misaligned bundles and visa versa. Additionally, these computer systems suffer penalty cycles on transitions from the execution of a single instruction to the execution of a bundle of instructions. This reduces the speed and overall performance of the processor.
Therefore, what is needed is a computer system and method that provides flexibility in switching between execution of single instructions and execution of bundles of instructions, and which incurs no penalty cycle when switching between single instructions and bundles of instructions, or between aligned bundles and misaligned bundles.
One other area that needs improvement in selecting pairs of instructions for simultaneous execution in a processor pertains to instruction buffering. Processors commonly use buffers to receive and temporarily store instructions fetched from cache for execution. Currently, computer system instruction pre-fetch buffers tend to be large and complicated, commonly requiting storage space for more than two 64-bit (doublelongword) instruction entries. Such large buffering designs are currently needed to alleviate the problems associated with executing bundles which fall across alignment boundaries.
Therefore, in order to solve this problem, what is needed is a computer pre-fetch buffering system that requires a minimum amount of space, less than or equal to the maximum instruction length.