Current processors technology allows different execution units (pipes) to operate in parallel. As the number of gates on silicon chip increases, a major problem that processor architecture meets is the limited parallelism within the instruction stream. Such parallelism is limited because of data dependencies as well as because of control dependencies.
For a better understanding of the terms data dependency and control dependency consider, for example, the following exemplary computer program segment that performs a certain computation on non-overlapping arrays of variables:
______________________________________ loop: A[j + 1]: = A[j] + j B[j]: = 4*B[j] j: = j + 1 i: = 1 + j while (j! = 1000) go to loop k: = k + 1 ... ______________________________________
From the processor point of view, the program can be illustrated as a stream of instructions (Architectural instructions or micro machine instructions) that can be divided into three main categories: (1) arithmetic instructions that aims to modify values existing in the processor's storage areas, such as registers, (2) control instructions that aims to determine what instruction out of the program's area will be execute next and (3) load store instructions that aims to move data from and to the process's local memory area (such as registers) and the memory. The goal of ALL the modern processors is to perform as many operations as possible in parallel, while keeping the correctness of the execution; i.e., the user should not distinguish between a program running on such a parallel machine and the same program running on machine that performs the operation sequentially.
Modern processors analyze the streams of instructions the processor should execute and determine which operations the processor can perform in parallel. To this end, the processor needs to examine for each operation whether all of the inputs needs for the operation are available at that time. Input may not be available for different reasons such as the value has not be calculated (by another instruction) or the fetching process of the data into the processor's area have not been completed yet. All the operations that have been found to be ready can be executed in parallel (See Mike Johnson, Design of Superscalar Processor). The average number of operations the processor, or a system can execute in parallel is termed Instruction Level Parallelism--ILP. As for the above simple example, intuitively, the instruction B[j]:=4*B[j] albeit placed after the instruction A[j+1]:=A[j]+j in the source code, both instructions can be marked "ready" at the same time, since the outcome of the latter (i.e. modifying the value of certain cell in the array `A`) does not affect the former (i.e. modifying the value of certain cell in array `B`). Thus, the processor can execute them in parallel. In other words, the latter scenario illustrated the lack of data dependency.
Conversely, the processor must execute the instructions j:=j+1 and i:=1+j exactly in the order they appear (in respect to each other) since the value of j, as determined in the former instruction, affects the value of i in the latter instruction. Put differently, the latter scenario illustrates data dependency. As will be explained in greater detail below, data dependencies are divided into three main categories, i.e. output data dependency, anti data dependency and true data dependency (Superscalar Microprocessor Design, Mike Johnson, Book).
In order to take advantage of independent operations and in order to speed up the execution rate, current processor technology integrates the mechanisms to exploit parallelism within the multi-stage (pipeline) architecture, as illustrated e.g. in FIG. 1.
In the example of FIG. 1, several instructions are fetched in parallel into an instruction buffer (in the fetch stage), then they are decoded, the dependencies are explored and the instructions are placed in a buffer termed reorder buffer. The issue stage sends the instructions which are ready to the execution stage; i.e., their input data is available and they do not depend on outcome of other operation(s), and so, independent instructions can be executed in parallel. The outcome of the execution can update register(s), memory location(s), hardware flag(s), the program counter and/or operations on the reorder buffer to make instructions which are marked as stalled to be moved into ready to be executed state.
The amount of parallelism exists in the execution stream, depends on the size of the instruction window; i.e., the number of instructions that are classified by the system as ready to be executed. Control instruction and hardware complexity limits the size of the instruction window significantly. It was reported (Hennessy and Patterson: Computer architecture: a quantitative approach) that as many as 20% of the instruction the processor executes can be control instruction, bringing the average size of the instruction window available between control instructions to 4 instructions. It is clear that without enlarging the size of that window, the amount of parallelism the architecture can benefit from is very limited. Control instructions usually appears as different types of conditional or non-conditional branch instructions. (For simplicity, conditional or non-conditional branch instructions will also be referred to, in short, branch or control instructions). An example of conditional branch being: "while (j!=1000) go to loop". Conditional branch instruction includes, as a rule, a logical condition and a destination address. When the condition is met, the next instruction for execution resides in dostination. If, on the other hand, the condition is not met, the next instruction for execution follows the branch instruction (some architecture such as VAX.TM. introduced more sophisticated branch instructions that uses tables of addresses to control the instruction flow, but this instructions has the same nature from the perspective of the present invention.) When executing a sequential set of instructions that does not include branch instructions, the issue of control dependency is irrelevant since the address of the next instruction to be executed can be unequivocally resolved. In contrast, the next instruction for execution cannot be unequivocally predicted if branches are encountered, since at the prediction stage the processor has not resolved the logical condition and the target address calculation that is associated with the branch instruction and accordingly it is unclear whether jump occurs, (in which case the next address for execution resides at the destination address of the branch), or alternatively the condition is not met (in which case the next instruction to execution is the one that follows the conditional branch instruction).
Bearing this in mind, branches (control instructions) can significantly reduce the amount of parallelism the processor can utilize (J. Lee and A. J. Smith. Branch Prediction Strategies and Branch Target Buffer Design. IEEE Computer, January 1984, pp. 6-22. and T.-Y. Yeh and Y. N. Patt. Adaptive Training Branch Prediction. The 24.sup.th ACM/IEEE International Symposium and Workshop on Micro-architecture). When the branch instruction is evaluated in decode stage, the decode stage and the fetch stage cannot continue their operation until the branch conditions are resolved and the address of next instruction to be fetched and decoded is calculated.
This drawback is clearly illustrated in the specified exemplary computer program segment. Thus, encountering the "if" statement (i.e. conditional branch) hinders the possibility of keeping on revealing independencies among the instructions, in the manner specified, since it is not known at this stage (which, as recalled, precedes the actual execution) whether the next instruction is A[j+1]:=A[j]+j (i.e. the condition of the "if" clause has been met), or k:=k+1 (i.e. the condition of the jump has not been met). Of course, the condition will be resolved at run time, but relying on this misses the whole point of prediction in-advance (i.e. before actual execution) which instructions are independent, one with respect to the other, so as to route them, in due course, to separate execution units thereby accomplishing parallel processing.
To cope with the problem of conditional branches, various branch predictors were suggested; e.g., when the decoder hits a branch, it predicts the outcome and continue to decode the operands based on this assumption. The instructions that were decoded based on this prediction are marked as speculative. As soon as the branch can be resolved (calculate its condition and the target address), all the speculative instructions belonging to that decision are either validated (in case of successful prediction) or be discarded (in case of failure) from the system. This mechanism was found to be very effective and so is used by many of the existing processors (T. A. Diep and C. Nelson and J. P. Shen Performance Evaluation of the PowerPc 620 Micro architecture. The 22.sup.nd Annual Symposium on Computer architecture, 1995, pp 163-174. [Diep95]).
Branch predictors of the kind specified whilst increasing the level of parallelism (ILP) to a certain extent, are not always successful in their prediction in which case the instruction level parallelism (ILP) is, of course, detracted since all the speculative instructions that followed the mis-predicted branch instruction must be discarded.
The ILP accomplished by virtue of the branch predictors does not accommodate the ever increasing capacity of modern processors in terms of their capability to concurrently process instructions.
In this connection it should be noted that the trend in modern processors is to employ more and more execution units, which of course calls for enhancement of the ILP in order "to keep them all busy" and thereby expediting the processing time of the executed application. An example of this trend is manifested by the increase in number of execution units that is employed in the known PowerPC.RTM. 601 model (three units) compared to the succeeding generation PowerPC.RTM. 620 which already employs 6 execution units.
The necessity for increased parallelism further stems from the use of known pipelining techniques which affords overlapping of multiple instructions even when one execution unit is utilized and apportions when multiple execution units are utilized. A detailed discussion of the pipelining concept can be found, e.g. in: David A. Patterson and John L. Hennessy, "Computer Architecture: A quantitative approach", Morgan Kaufmann Publishers Inc., Section 6.1 (1990).
Having described the improvement in ILP accomplished by coping with control dependency, the herein below description focused on other hitherto known techniques for further enhancing ILP by coping with the so called anti-data dependency and output data dependency (referred to also as write-after-read conflict and write-after-write conflict, respectively). It should be noted that in the context of the invention, "write-instruction" and/or "read instruction" should be construed as forming part of write-after-write, read-after-write or write-after-read conflicting instructions. For a better understanding of the foregoing, attention is directed to the following exemplary computer program segment written in pseudo-assembly language:
______________________________________ ADD R1, R2, R3 ADD R4, R7, R1 ADD R1, R7, R2 ______________________________________
The first instruction sums the contents of registers R2 and R3 and stores the result in register R1. Likewise, the second and third instructions realize the arithmetic operations R4=R7+R1 and R1=R7+R2 respectively.
Whilst, ideally, it would be desirable to route the three specified instructions to three separate execution units, respectively, so as to attain maximal ILP, it is readily seen that the first and second instruction constitute true data dependency (referred to also as read-after-write conflict) since the input operand R1 of the second instruction (the read instruction), can be set only after the termination of the execution of the first instruction (the write instruction), i.e. R1 holds the value of R2+R3. Accordingly, the first and second instructions should be executed serially. The first and the third instructions exemplify the output data dependency (or write-after-write conflict). Write-after-write conflict always involves instructions write to the same location. By this particular example it is not possible to execute the first and third instructions simultaneously (and only afterwards the second instruction), since the second instruction requires as an input R1=R2+R3. Had the first and third instructions been executed simultaneously and assuming that the third execution unit is slower (in only fraction of a second), it would have result in R1 holding the value of R2+R3 upon the termination of the first execution unit, and immediately afterwards (i.e. when the third execution unit terminates its calculation) the latter value of R1 is overwritten by a new value (i.e. R1 holds the value of R7+R2). Now the execution of second instruction is erroneous since the input operand R1 holds a wrong value (i.e. R7+R2 rather than R2+R3).
The write-after-write conflict is easily resolved, e.g. by employing a so called register renaming technique (Superscalar Microprocessor Design, Mike Johnson, Book). Thus, for example, the register R1 of the third instruction is assigned to a virtual register R' (which may practically be a free physical register R.sub.n). Now, the first and the third instruction may be executed simultaneously and upon termination of the first instruction, the second instruction is executed.
By following this approach, the input operand R1 of the second instruction stores the correct value of R2+R3 whereas the result of the third instruction is temporarily stored in R' (R.sub.n). All that remains to be done is to move the contents of register R' to R1 so as to duly realize the third instruction.
A similar approach is used with respect to the so called "anti-data" dependency and a detailed discussion of the anti-data and output-data dependencies as well as the possible solutions to cope with this phenomenon can be found in (Superscalar Microprocessor Design, Mike Johnson, Book).
All the hitherto known techniques do not address the problem of true data dependency as explained above, and albeit they improve the ILP, this does not bring about sufficient utilization of the multiple execution units employed by present and future processors.
Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen in their publication (Value Locality and Load Value Prediction, Proceeding of the PACT Conference 1996) have introduced a new technique which purports to address to a certain extent, the problem of true data dependency.
The proposed technique attempts to overcome the inherent delay in a program's execution that occurs when encountering a "load" instruction. Normally, a load instruction loads an operand from a source address into a destination register. If the source address is located in a memory other than the fast cache or physical register, a relatively long number of processor cycles (hereinafter wait cycles) are required for fetching the operand from the source address. (The same applies, of course, to the destination address.)
Consider, for example, the following two instructions in pseudo-assembly language:
LOAD R1, source PA1 ADD R3, R2, R1 PA1 the processor further comprising a decoder for decoding said instructions, an issue unit for routing decoded instructions to at least one execution unit; PA1 the processor is characterized in further having a predictor being responsive to a first set of instructions, from among said value generating instructions, for predicting, with respect to each one instruction in said first set of instructions, a predicted value that is determined on the basis of a prediction criterion which includes; (i) at least one previous value generated by said one instruction; and (ii) at least one stride; PA1 said issue unit being further capable of detecting a sequence of write instruction followed by at least one read instruction, the sequence constituting read-after-write conflicting instructions, such that the destination of said write instruction and at least one input operand of each one of said read instructions being the same; said write instruction belongs to a second set of instructions from among said first set of instructions and each of the at least one read instruction belongs to a third set of instructions from among said instructions; PA1 said issue unit is capable of assigning the predicted value of said write instruction to the at least one operand of said read instructions thereby rendering the latter speculative instructions, and is further capable of routing said write and at least one of said read instructions to distinct execution units, from among said at least two execution units, so as to accomplish parallel processing; and PA1 (i) fetching instructions stored in said memory device; each instruction constitutes either a value generating instruction or a non-value generating instruction; each one of said value generating instructions having at least one destination for storing said value; PA1 (ii) decoding said instructions; PA1 (iii) routing decoded instructions to at least one execution unit; PA1 (iv) predicting with respect to each one instruction in a first set of instructions, from among said value generating instructions, a predicted value that is determined on the basis of a prediction criterion which includes: (i) at least one previous value generated by said one instruction; and (ii) at least one stride; PA1 (v) detecting a sequence of write instruction followed by at least one read instruction, the sequence constituting read-after-write conflicting instructions, such that the destination of said write instruction and at least one input operand of each one of said read instructions being the same; said write instruction belongs to a second set of instructions from among said first set of instructions and each of the at least one read instruction belongs to a third set of instructions from among said instructions; PA1 (vi) assigning the predicted value of said write instruction to the at least one operand of said read instructions thereby rendering the latter speculative instructions; PA1 (vii) routing said write and at least one of said read instructions to distinct execution units, from among said at least two execution units, so as to accomplish parallel processing; and PA1 (viii) committing at least said speculative instructions. PA1 the processor further comprising a decoder for decoding said instructions, an issue unit for routing decoded instructions to at least one execution unit; PA1 the processor is characterized in further having a predictor being responsive to a first set of instructions, from among said calculating instructions, for predicting, with respect to each one instruction in said first set of instructions, a predicted value that is determined on the basis of a prediction criterion which includes: (i) at least one previous value generated by said one instruction; PA1 said issue unit being further capable of detecting a sequence of write instruction followed by at least one read instruction, the sequence constituting read-after-write conflicting instructions, such that the destination of said write instruction and at least one input operand of each one of said read instructions being the same; said write instruction belongs to a second set of instructions from among said first set of instructions and each of the at least one read instruction belongs to a third set of instructions from among said instructions; PA1 said issue unit is capable of assigning the predicted value of said write instruction to the at least one operand of said read instructions thereby rendering the latter speculative instructions, and is further capable of routing said write and at least one of said read instructions to distinct execution units, from among said at least two execution units, so as to accomplish parallel processing; and PA1 commit unit for committing at least said speculative instructions. PA1 (i) fetching instructions stored in said memory device; each instruction constitutes either a value generating instruction or a non-value generating instruction; each one of said value generating instructions having at least one destination for storing said value; said value generating instructions include calculating instructions; PA1 (ii) decoding said instructions; PA1 (iii) routing decoded instructions to at least one execution unit; PA1 (iv) predicting with respect to each one instruction in a first set of instructions, from among said calculating instructions, a predicted value that is determined on the basis of a prediction criterion which includes: (i) at least one previous value generated by said one instruction; PA1 (v) detecting a sequence of write instruction followed by at least one read instruction, the sequence constituting read-after-write conflicting instructions, such that the destination of said write instruction and at least one input operand of each one of said read instructions being the same; said write instruction belongs to a second set of instructions from among said first set of instructions and each of the at least one read instruction belongs to a third set of instructions from among said instructions; PA1 (vi) assigning the predicted value of said write instruction to the at least one operand of said read instructions thereby rendering the latter speculative instructions, PA1 (vii) routing said write and at least one of said read instructions to distinct execution units, from among said at least two execution units, so as to accomplish parallel processing; and PA1 (viii) committing at least said speculative instructions.
As shown, after loading the content of the "source" address into register R1 it is added to R2 and stored in register R3. Obviously, the execution of the second instruction (ADD) can commence only upon the termination of the first instruction, thereby constituting true data dependency.
Since, as specified before the load instruction may require a few wait cycles until the input operand is fetched, this necessarily leads to slowing down the execution time of the entire program due to the constraint to commence the execution of the ADD instruction only upon the termination of the load instruction. This is particularly relevant when bearing in mind that a typical computer program includes few and occasionally many load instructions.
Having explained the normal operation of load instruction in the context of true data dependency, there now be explained the proposed improvement of the Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen which introduces ILP in read-after-write conflict in the specific case of load instruction. The proposed improvement is based on the fundamental premise that the value of a source operand fetched from a source address has not changed since the previous access to the same individual load instruction. Applying this approach to the previous exemplary instructions results in that after executing the "load" instruction for the first time (which as recalled may require few processor wait cycles) the value stored in the source address is copied to a readily accessible data structure, (i.e. an indexed but not tagged hash table). Now, when the same instruction is encountered for the second time, the readily accessible data structure is accessed and the value is retrieved therefrom and may be stored immediately at the input operand of the second instruction (ADD) before the actual value is fetched from the source address. This scenario enable to expedite the execution of the program since the second instruction (ADD) is executed before the first instruction (load) has been terminated.
All that remains to be done is to commit afterwards the result of the second instruction, similar as to the verification process of the branch prediction i.e. to ascertain that the predicted value (as retrieved from the readily accessible data structure) matches the value as actually retrieved from the source address. If in the affirmative, the second instruction is committed and a certain level of parallelism has been accomplished. If on the other hand, the values do not match the results of the second instruction has to be discarded and the instruction must be executed again. The latter case of mis-prediction involves a penalty of a single clock cycle.
The proposed "expedite load" technique resembles to a certain extent the hitherto known "forward" technique for enhancing execution. As is well known to those versed in the art, "forwarding" is a technique that is normally used in the case that the output of a given instruction is used as an input of a succeeding instruction (i.e. read-after-write conflict). In the latter case the forwarding technique affords to skip conventional "house cleaning" stages, that are normally executed upon termination of the execution of an instruction, by "injecting" the result to the waiting operand of the succeeding instruction (the read instruction), immediately upon the production thereof at the output of the execution unit, thereby expiating the processor performance.
The proposed "load expedite" purports to offer a solution for true data dependency prediction only in the context of load instructions, accordingly the accomplished ILP is indeed limited. As reported, the improvement of only 3-9% is obtained when the "load expedite" method is implemented in Power Pc 620.TM. and Alpha 21064.TM..
It should be noted that several other hitherto known per se technologies are capable of addressing the same problem whilst accomplishing comparable results such as data pre-fetching and write-cache techniques. In so far as practical realization of the proposed "load expedite" technique is concerned, it should be noted that in order to benefit from executing multiple mutually dependent load instructions at the same time, multiple load/store units are required to access the memory system, and possibly to carry out parallel verification procedures. However, utilizing multiple load/store units poses undue cost to current and future processor technology in terms of overly large die space and increased power dissipation.
It is accordingly a general object of the object of the present invention to further enhance ILP and in particular to offer, an enhanced technique to cope with read-after-write instruction conflict (or true data dependencies).