1. Field of Invention
The present invention relates generally to superscalar microprocessors. More particularly, the present invention relates to a method and system for dynamic dependency monitor and control.
2. Related Art
In order to achieve high performance, multiple instructions may be executed per clock cycle in superscalar microprocessors. Moreover, storage devices such as a register or an array capture their values according to the clock cycle. In an exemplary embodiment, a storage device captures a value on the rising or falling edge of a clock signal that defines the clock cycle, and the storage device then stores the value until the following rising or falling edge of the clock signal.
Although instructions may be processed in any number of stages, instruction processing generally comprises fetching instruction, decoding instruction, executing instruction, and storing the executed results in a destination specified in the instruction. Furthermore, each instruction may be processed in a pipelined fashion in logic circuits herein referred to as “instruction processing pipelines”.
A superscalar microprocessor receives instructions in order, and although a compiler may recompile the order of the instructions from a program, the order of instruction dependency still needs to be maintained. However, whereas in-order instruction execution guarantees the integrity of the original program, out of order execution may alter the intended functionality of the original program. For example, a dependency problem may occur if the instructions shown below were executed out of order:
addr0, s1, s2Instruction 1mulO, s3, r0Instruction 2wherein the first instruction aggregates the values stored in a first source operand s1 and a second source operand s2 and stores the sum in a destination temporary register r0, and the second instruction multiplies the values stored in a third source operand s3 and the temporary register r0 and stores the product in an output register O. As referred to herein, a source operand is a value operated upon by the instruction and a destination operand is the result of the instruction. In the example shown above, the second instruction requires a source operand (r0) whose value is determined in the first instruction, therefore the second instruction is said to have a dependency on the first and cannot be executed until the first instruction is fully executed. In the example above, assuming a pipeline latency of five cycles, the microprocessor cannot begin executing the second instruction until five cycles after the first instruction launched.
One conventional method employed to solve the dependency problem as illustrated above is to execute the instructions with a multi-thread method. In an exemplary embodiment, a number of single instruction multiplex data (SIMD) processors are employed wherein each SIMD processor processes a distinct data stream of the same instruction. An example program shown below is an illustration of an SIMD approach using six threads to process Instruction 1 and Instruction 2 shown above, assuming a five cycle arithmetic logic unit (ALU) latency:
addstr0.r0, str0.s1, str0.s2Data Stream 1addstr1.r0, str1.s1, str1.s2Data Stream 2addstr2.r0, str2.s1, str2.s2Data Stream 3addstr3.r0, str3.s1, str3.s2Data Stream 4addstr4.r0, str4.s1, str4.s2Data Stream 5addstr5.r0, str5.s1, str5.s2Data Stream 6mulstr0.O, str0.s3, str0.r0Data Stream 1mulstr1.O, str1.s3, str1.r0Data Stream 2mulstr2.O, str2.s3, str2.r0Data Stream 3mulstr3.O, str3.s3, str3.r0Data Stream 4mulstr4.O, str4.s3, str4.r0Data Stream 5mulstr5.O, str5.s3, str5.r0Data Stream 6in the example shown directly above, six data streams are used to process Instruction 1 and Instruction 2. Moreover, Instruction 2 depends on Instruction 1 due to its use of register r0, and therefore Instruction 2 must wait at least five cycles after Instruction 1 begins before proceeding to execution. As shown in the example above, dependency problems do not arise if the number of threads exceeds the number of latency cycles. However, ALU latency may be significant in various systems, and the increasing number of threads is costly as each thread requires additional hardware to incorporate components such as input buffers and temporary registers.