1. Field of the Invention
This invention relates to computer architectures and more particularly to identifying data dependencies among instructions.
2. Description of the Related Art
In an effort to make computers as efficient as possible, processors have relied on parallelism to achieve processing efficiencies. In particular, processor architectures have developed in which multiple instructions are executed in parallel on multiple execution units, such as integer units, floating point units, etc.
Typically, instructions direct computer operations by causing an operation to occur on data. The operation may be, e.g., an arithmetic operation, a load/store operation, or a logical operation. The instruction specifies the operation as well as the operand(s) affected by the operation. The instruction specifies the “operand” by describing its location in the computer. Operands may be located in a register in which case a register within the processor contains the data on which the instruction operates. Operands may also be located in memory. Operands may also be immediate, in which case the data is contained in the instruction itself. A source operand value is a value upon which the instruction operates, and a destination operand is a location in which the results of the instruction are stored.
One problem with executing instructions in parallel is that operands required to complete operations specified by the instruction may not be available. For example, assume an instruction B uses an operand whose value is determined by a previous instruction A, and instruction A has not yet completed. In that circumstance, instruction B has to wait for the operand value to be determined by instruction A and therefore cannot be executed in parallel with the instruction A. For a set of three instructions (A, B, C), the following patterns of dependency are possible: no dependency; B depends on A's results; C depends on A's result; both B and C depend on A's result; C depends on B's result; C depends on A's and B's results; C depends on B's result, which depends on A's result, i.e., serial dependency.
In a computer system that executes multiple operations per machine cycle, either software (i.e., the compiler) or hardware control logic determines those functional operations that may be executed in parallel. In the example above for instructions A, B, and C, if there are no dependencies among A, B, and C, all three instructions may be executed in parallel (assuming there are three execution units available). If, e.g., B depends on A's results, then A and C may be executed in parallel, and B subsequently.
In Very Long Instruction Word (VLIW) computer architectures, a compiler (i.e., software) determines those operations that can be executed in parallel when translating a high level source language such as C++ into machine instructions suitable for execution. The compiler accounts for the data dependencies in the compiled code. When the executable code is presented to the VLIW processor, the VLIW processor executes the code without having to worry about data dependencies. Thus, one advantage of a VLIW architecture is that the hardware does not have to check for data dependencies among instructions.
Another way to account for data dependencies in prior art systems was to simply execute the code in order without parallelism. In that way, data dependency problems are eliminated, but so are the advantages of parallel execution.
In addition to executing operations in parallel, another way that computer architectures improve performance is to overlap the execution steps of different instructions using pipelining. In pipelining, the various steps of instruction execution are performed by independent units called pipeline stages. Pipeline stages are generally separately clocked registers, and the steps of different instructions are executed independently in different pipeline stages. Thus, one instruction may be fetched, another decoded and a third instruction executed all at the same time in a pipelined architecture. Overlapping various stages of instruction execution reduces the average number of cycles required to execute an instruction, but not the total amount of time required to execute an instruction.
Superscalar processor architectures also provide greater efficiencies by concurrently executing multiple instructions. The term “superscalar” describes a computer architecture that includes concurrent execution of scalar instructions. Scalar instructions are the type of instructions typically found in general purpose microprocessors. Because instructions are executed concurrently, greater efficiency can be achieved. However, unlike VLIW architectures, the compiler program for a superscalar processor translates source code into an executable file but does not need to determine and solve the problem of data dependencies. Instead, control logic determines if there are data dependencies which constrain parallel execution of instructions. Conceptually, for a given window of instructions, e.g., 8 instructions, hardware detects data dependencies by checking to see if any operand depends on an output of a previous instruction. Note that although instructions may be executed out of order, instructions are retired in program order.
Typical superscalar computer architectures hold the execution of an instruction that needs data that is not available yet, either because the data has not been fetched or because the data is the result of a previous instruction that has not finished executing. If the processor cannot find an instruction to execute that has no dependencies (or if it has run out of resources to track dependencies), the processor just stalls execution of any instruction until the data arrives (thus creating a pipeline “bubble”).
Superscalar processors generally devote a significant processor area to circuitry used to identify data dependencies among a set of instructions so that the processor can appropriately execute instructions. Such dependency hardware is rather complex since there are multiple data dependencies possible between any two instructions. A typical reduced instruction set computer (RISC) instruction commonly used in superscalar implementations has two input operands and one output value. The number of dependencies between groups of instructions in an instruction window grows significantly with the number of instructions since an additional instruction has to be compared with every other instruction in the group. Complexity is also determined by the number of instructions that the processor attempts to decode, issue, and complete at the same time (e.g., in a single cycle). In one approach, dependency is checked by comparing the addresses of the source registers of each instruction to the addresses of the destination registers of each previous instruction in the group. For example, if instruction A reads a value from a register that is written to by instruction B, then instruction A is dependent upon instruction B and instruction A cannot start until instruction B has finished.
It would be advantageous to execute instructions without paying the overhead required to check for data dependencies in hardware or having to provide compiled code which does not have data dependencies, e.g., in the VLIW approach.