This invention draws from two different areas: parallel-prefix circuits and superscalar-processor circuits. Parallel-prefix circuits have, in the past, been most often used for parallel computation in machine such as the Connection Machine CM-5 supercomputer. (See, for example, [25, 18, 16, 7, 2].) Throughout this patent, numbers enclosed in square brackets refer to the references cited in Section 4.4 below, each of which is incorporated by reference herein.
Superscalar processor circuits are used to implement processors that exploit instruction-level parallelism (ILP) using out-of order execution. Instruction-level parallelism is the parallelism that can be found in a serial instruction stream because certain of the serial chain of operations are actually independent. One strategy to exploit ILP is to execute the instructions in a different order from that specified by the original serial execution, hence xe2x80x9cOut-of-order executionxe2x80x9d. A mechanism for out-of-order execution was first described in [44].
The rest of this section first describes parallel-prefix circuits and then describes today""s superscalar-processor circuits.
Notation
We use the following notation:
O: The xe2x80x9cbig-Ohxe2x80x9d notation is used to indicate how fast a function grows, ignoring constant factors. Intuitively, we write xe2x80x9cg(x) is O(f(x))xe2x80x9d if g(x) grows no faster than f(x), for large x ignoring a constant multiplicative factor.
xcexa9: The xe2x80x9cbig-Omegaxe2x80x9d notation is used similarly. Intuitively, we write xe2x80x9cg(x) is xcexa9(f(x))xe2x80x9d if g(x) grows no slower than f(x), for large x ignoring a constant multiplicative factor.
"THgr": The xe2x80x9cbig-Thetaxe2x80x9d notation is the intersection of big-Oh and big-Omega. g(x) is "THgr"(f(x)) exactly when g(x) is O(f(x)) and g(x) is xcexa9(f(x)). Intuitively, this means that g(x) grows at the same speed as f(x) for large x ignoring a constant multiplicative factor. See [5] for a complete and exact treatment of the O, xcexa9, and "THgr" notation.
lg I is the logarithm, base 2, of I.
log I is the logarithm of I in the natural base e.
Thus, lg I=log2 I. Note that choice of the base of a logarithm makes only a difference of a constant factor. E.g., lg I≈0.693 log I, which means that lg I is "THgr"(log I) and log I is "THgr"(lg I). In this document we generally use the log base two and we use binary trees. There may be engineering reasons to use a trees of higher degree because the base of the log changes which gives a constant-fold change in the performance. Our designs work for any base and any degree of the tree, including trees of mixed degree.
Ceiling: We write ┌x┐ (pronounced xe2x80x9cthe ceiling of xxe2x80x9d) to be the smallest integer greater than or equal to x.
append lists: If a and b are two lists, then {a,b} is the concatenation of the two lists.
The set of integers x such that axe2x89xa6xxe2x89xa6b is denoted by [a . . . , b].
Base numerals: When we write 00102 the string xe2x80x9c0010xe2x80x9d should be interpreted as a base two number. Thus 10102=128=1010=10.
1.1 Parallel Prefix
This section contains a tutorial on parallel-prefix circuits. First we define the prefix problem, then show how to implement it to run fast. Parallel-prefix circuit design is a technique that can often convert linear-time circuits into logarithmic-time circuits. (See, for example, [5] for a discussion of log-depth parallel-prefix circuits. Segmented parallel prefix circuits are the subject of an exercise of [5], and were implemented in the CM-5 supercomputer [25, 18, 16, 7].)
The prefix problem is as follows. One is given an associative operator {circle around (xc3x97)} with an identity value, I. Given some inputs x0, x1, . . . , xnxe2x88x921 we need to compute y0, y1, . . . , yn as: yi=x0{circle around (xc3x97)} x1{circle around (xc3x97)} . . . {circle around (xc3x97)}xixe2x88x921, where y0 is defined to be the identity value for the {circle around (xc3x97)}. (For example, if {circle around (xc3x97)} is addition (and the identity for addition is 0), then yi=xcexa3j=0ixe2x88x921xj.)
Sometimes one wants special initial and final values. One can formulate the prefix problem as having an initial value z that is passed to the circuit. In this case we have yi=z{circle around (xc3x97)}x0{circle around (xc3x97)}x1 . . . {circle around (xc3x97)}xixe2x88x921. This can be viewed as the earlier case simply by renumbering the subscripts so that we have       x    i    xe2x80x2    =      {                            z                                                                    if                ⁢                                  xe2x80x83                                ⁢                i                            =              0                        ,                          xe2x80x83                        ⁢            and                                                            x                          i              -              1                                                            otherwise            .                              
and then performing a parallel prefix on the xxe2x80x2 values. Similarly, one would like to get a final output value w from the circuit which is defined to be w=z{circle around (xc3x97)}x0{circle around (xc3x97)}x1 . . . {circle around (xc3x97)}xn.
Again, this can be implemented by the earlier case by manipulating subscripts. We simply extend the subscript range to n+1 and compute w as yn+1.
The natural and easy thing to do is to compute the yi""s serially. First one computes each yixe2x88x921 and uses that to compute yi as       y    i    =      {                                        the            ⁢                          xe2x80x83                        ⁢            identity            ⁢                          xe2x80x83                        ⁢            value                                                                              if                ⁢                                  xe2x80x83                                ⁢                i                            =              0                        ,                          xe2x80x83                        ⁢            and                                                                          y                              i                -                1                                      ⊗                          xe2x80x83                        ⁢                          x                              i                -                1                                                                          otherwise            .                              
FIG. 1 shows a circuit 10 that computes the prefix operation in linear time. Circuit 10 comprises a plurality of function generators 15, each having two inputs and one output. Each output is connected as one of the inputs to the next function generator and the other input is an x value. It is easy to see that prefix can be computed in time linear in n. It is surprising to many people that the prefix problem can be solved in time logarithmic in n by using a circuit structure known as parallel prefix. The next three sections review the parallel prefix circuit.
1.1.1 Log-Time Parallel Prefix
Before reviewing the construction of parallel-prefix circuits in general, we present an example. FIG. 2 shows a parallel-prefix circuit 20 that takes eight inputs, x0, x1, . . . , x7, and computes their prefix sums       y    i    =            ∑              j        =        0                    i        -        1              ⁢          xe2x80x83        ⁢                  x        j            .      
Circuit 20 comprises fourteen two-input adders 25 connected in a tree-like structure by signal wires 27 as shown. The inputs xi are provided at the bottom of the circuit, and the outputs yi come out the bottom, with output yi coming out just to the left of where input xi goes in. The identity (zero) goes in at the top and the sum of all the values (y8) comes out at the top. The critical-path length of this circuit is logarithmic in the number of inputs. This circuit can be laid out in VLSI using an H-tree layout [23] with a resulting area of about A=O(n2b2) where b is the number of bits in the result yn. The resulting wire delay is about O({square root over (A)}). We can further optimize the parallel-prefix sum circuit of FIG. 2. If we use a redundant representation (such as the carry-save adder as used in Wallace-tree multipliers), with a single final sum at the end, we can perform the entire parallel-prefix sum in only O(log n) gate delays as opposed to O(log2 n). Furthermore, often the width of the data values is smaller at the inputs than at the outputs (for example, when the inputs x, to a sum are only one bit each, but the output is log n bits, a case which will come up later in this patent), then we can carefully size the ALUs so that they take just the right number of bits as inputs and produce the right number of bits as outputs, which will save area and power. One important special case is when the xi""s are one-bit each. The problem of summing one-bit inputs is often referred to as the enumeration problem.
In general, a parallel prefix circuit comprises a tree as shown in FIG. 3. The tree 30 comprises a plurality of treefix modules 35 at its vertices connected by signal wires 37 as shown. The xi values are input at the leaves of the tree (at the bottom of the figure). The results yi are also output at the leaves, adjacent to the corresponding xi""s. The identity value I is input at the root of the tree (at the top of the figure) and the result y8 of combining all the yi values is output at the root of the tree. The signal wires may be several bits wide in order to encode the necessary information. The values along each signal wire have been labeled. We use the notation Pij to indicate that a particular wire carries the value xi{circle around (xc3x97)}xi+1{circle around (xc3x97)} . . . {circle around (xc3x97)}xj. Thus       p          i      ,      j        =            ⊗              k        =        i            j        ⁢                  x        k            ⁢      ¨      
(If j less than i then pij is the identity value.) The circuit computes yj=p0jxe2x88x921 for 0xe2x89xa6jxe2x89xa67. (See [5] for a discussion of how to adapt the circuit of FIG. 3 to compute the special cases of w and z mentioned above.)
A treefix module 35 of FIG. 3 is shown in more detail in FIG. 4. Each treefix module has two function generators 42, 43, three inputs 44, 45, 46, and three outputs, 47, 48, 49 arranged in pairs of an input and an output. One pair connects to the circuit above, one to the lower-left and one to the lower-right. There are some integers j, k, and m, with j less than k less than m, such that the data coming from above will be P0jxe2x88x921, the data coming from the lower-left will be pj,kxe2x88x921 and the data coming from the lower-right will be Pk,mxe2x88x921. The treefix module then produces pj,mxe2x88x921 which is output to above, p0jxe2x88x921 which is output to lower-left and p0,kxe2x88x921 which is output to lower-right. The reader can check that these are in fact the values carried on the wires of FIG. 3. The circuit to compute these values is very easy to design since
pj,mxe2x88x921=pj,kxe2x88x921{circle around (xc3x97)}pk,mxe2x88x921,
p0,jxe2x88x921=p0,jxe2x88x921,
and
p0,kxe2x88x921=p0,jxe2x88x921{circle around (xc3x97)}pj,kxe2x88x921,
Although the tree in FIG. 3 has a branching factor of two (that is, it is a binary tree), all the parallel-prefix circuits described in this patent can be implemented with an arbitrary branching factor. The choice of an appropriate branching factor depends on the parameters of a particular technology. For illustration, we will show all our circuits with a branching factor of two.
1.1.2 Segmented Parallel Prefix
A segmented parallel-prefix circuit is also known. The segmented-prefix circuit is similar to the prefix circuit. A segmented-prefix operation comprises a collection of separate prefix operations over adjacent non-overlapping segments of the input x0, x1, . . . , xnxe2x88x921. The way this works is that in addition to providing inputs xp to the prefix circuit we provide additional 1-bit inputs si called the segment bits. The segment bits indicate where a new segment is about to begin. The output is             y      i        =                  ⊗                  j          =                      k            i                                    i          -          1                    ⁢              x        i              ,
where ki=max{0, max{k less than i:sk=1}}.
Thus if we have             x              =                                 less than                       x            0                          ,                                      x          1                ,                                      x          2                ,                                      x          3                ,                                      x          4                ,                                      x          5                ,                                      x          6                ,                                      x          7                ,                                      x          8                ,                                      x          9                 greater than                                 xe2x80x83                    =                                 less than           1                ,                            2        ,                            3        ,                            4        ,                            5        ,                            6        ,                            7        ,                            8        ,                            9        ,                            10        ,         greater than                         s              =                                 less than                       s            0                          ,                                      s          1                ,                                      s          2                ,                                      s          3                ,                                      s          4                ,                                      s          5                ,                                      s          6                ,                                      s          7                ,                                      s          8                ,                                      s          9                 greater than                                 xe2x80x83                    =                                 less than           0                ,                            0        ,                            1        ,                            0        ,                            0        ,                            0        ,                            1        ,                            0        ,                            1        ,                            0        ,         greater than                         then                      xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                        k              =                       less than                   k                      0            ⁢                          xe2x80x83                        ,                                                        k          1                ,                                      k          2                ,                                      k          3                ,                            k                  4          ⁢                      xe2x80x83                    ,                                              k          5                ,                                      k          6                ,                                      k          7                ,                                      k          8                ,                                      k          9                 greater than                                 xe2x80x83                    =                                 less than           0                ,                            0        ,                            0        ,                            2        ,                            2        ,                            2        ,                            2        ,                            6        ,                            6        ,                            8        ,         greater than                         and                      xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                            xe2x80x83                        y              =                                 less than                       y            0                          ,                                      y          1                ,                                      y          2                ,                                      y          3                ,                                      y          4                ,                                      y          5                ,                                      y          6                ,                                      y          7                ,                                      y          8                ,                                      y          9                 greater than                                 xe2x80x83                    =                                 less than           0                ,                            1        ,                                      1          +          2                ,                            3        ,                                      3          +          4                ,                                      3          +          4          +          5                ,                                      3          +          4          +          5          +          6                ,                            7        ,                                      7          +          8                ,                            9        ,         greater than                                 xe2x80x83                    =                                 less than           0                ,                            1        ,                            3        ,                            3        ,                            7        ,                            12        ,                            18        ,                            7        ,                            15        ,                            9        ,         greater than             
A linear-time segmented parallel-prefix circuit 50 is shown in FIG. 5. Circuit 50, comprises a plurality of two-input function generators 55. One input to each function generator is the output of a two-input multiplexer (MUX) 58, which output is selected by a segment bit. One input to each MUX is the identity function. The other input is the output of the preceding function generator. The other input to each function generator is an x value. This is similar to the circuit of FIG. 1 except that MUXes have been added to handle the segment bits.
The segmented parallel-prefix tree circuit has the same structure as the ordinary parallel-prefix tree, except that we modify the treefix module to compute an additional segmentation signal sj,kxe2x88x921 , which is passed up the tree. The value sj,kxe2x88x921 indicates if any of the segments bits are equal to one. FIG. 6 shows a segmented parallel-prefix circuit 60 with eight leaf nodes (n=8). The circuit comprises a plurality of treefix modules 65 at the vertices connected by signal wires 67 as shown. The tree uses the slightly modified treefix module 65 shown in FIG. 7. Circuit 65 comprises two function generators 72, 73, two multiplexers 75, 76 and an OR gate 78. The circuit also comprises two inputs, 81, 82 and one output 83 for the segment bits and the same three inputs 84, 85, 86 and the three outputs 87, 88, 89 as in the treefix module of FIG. 4. An OR-gate 78 computes the segmentation signal that will be passed up. Multiplexer (MUX) 75 operates so that no value will be added from above to the value from the left subtree if there is a segment bit in the left subtree. MUX 76 operates so that no value will be added from the left subtree if there is a segment bit in the right subtree.
A segmented parallel-prefix circuit can be viewed as a prefix computation on values which are pairs:  less than P,S greater than  where P is the value being operated on by the original operator, and S is a segmentation bit. We have the following operator:                     ⟨                              P            i                    ,                      S            i                          ⟩            ⊗      seg        ⁢          ⟨                        P          r                ,                  S          r                    ⟩        =      {                                        ⟨                                          P                r                            ,              1                        ⟩                                                                              if                ⁢                                  xe2x80x83                                ⁢                                  S                  r                                            =              1                        ,                                                            ⟨                                                            P                  i                                ⊗                                  P                  r                                            ,                              S                i                                      ⟩                                                otherwise            .                              
We can show that this operator is associative. To do this we show that
( less than Pa,Sa greater than {circle around (xc3x97)} less than Pb,Sb greater than ){circle around (xc3x97)} less than Pc,Sc greater than = less than Pa,Sa greater than {circle around (xc3x97)}( less than Pb,Sb greater than {circle around (xc3x97)} less than Pc,Sc greater than ).
Proof: If Sc=1 then                                                         (                                                ⟨                                                            P                      a                                        ,                                          S                      a                                                        ⟩                                ⊗                                  ⟨                                                            P                      b                                        ,                                          S                      b                                                        ⟩                                            )                        ⊗                          ⟨                                                P                  c                                ,                                  S                  c                                            ⟩                                =                                    (                                                ⟨                                                            P                      a                                        ,                                          S                      a                                                        ⟩                                ⊗                                  ⟨                                                            P                      b                                        ,                                          S                      b                                                        ⟩                                            )                        ⊗                          ⟨                                                P                  c                                ,                1                            ⟩                                                                    =                      ⟨                                          P                c                            ,              1                        ⟩                                                        =                      ⟨                                          P                c                            ,                              S                c                                      ⟩                                and                                          (                                          ⟨                                                      P                    a                                    ,                                      S                    a                                                  ⟩                            ⊗                              ⟨                                                      P                    b                                    ,                                      S                    b                                                  ⟩                            ⊗                              ⟨                                                      P                    c                                    ,                                      S                    c                                                  ⟩                                      )                    =                                    ⟨                                                P                  a                                ,                                  S                  a                                            ⟩                        ⊗                          (                                                ⟨                                                            P                      b                                        ,                                          S                      b                                                        ⟩                                ⊗                                  ⟨                                                            P                      c                                        ,                    1                                    ⟩                                            )                                                                    =                                    ⟨                                                P                  a                                ,                                  S                  a                                            ⟩                        ⊗                          ⟨                                                P                  c                                ,                1                            ⟩                                                                    =                      ⟨                                          P                c                            ,              1                        ⟩                                                        =                      ⟨                                          P                c                            ,                              S                c                                      ⟩                                Otherwise                                                        (                                                ⟨                                                            P                      a                                        ,                                          S                      a                                                        ⟩                                ⊗                                  ⟨                                                            P                      b                                        ,                                          S                      b                                                        ⟩                                            )                        ⊗                          ⟨                                                P                  c                                ,                                  S                  c                                            ⟩                                =                                    (                                                ⟨                                                            P                      a                                        ,                                          S                      a                                                        ⟩                                ⊗                                  ⟨                                                            P                      b                                        ,                                          S                      b                                                        ⟩                                            )                        ⊗                          ⟨                                                P                  c                                ,                0                            ⟩                                                                    =                      (                                          ⟨                                                      P                    a                                    ,                                      S                    a                                                  ⟩                            ⊗                              ⟨                                                      P                    b                                    ,                                      S                    b                                                  ⟩                                      )                                and                                          (                                          ⟨                                                      P                    a                                    ,                                      S                    a                                                  ⟩                            ⊗                              ⟨                                                      P                    b                                    ,                                      S                    b                                                  ⟩                            ⊗                              ⟨                                                      P                    c                                    ,                                      S                    c                                                  ⟩                                      )                    =                                    (                                                ⟨                                                            P                      a                                        ,                                          S                      a                                                        ⟩                                ⊗                                  ⟨                                                            P                      b                                        ,                                          S                      b                                                        ⟩                                            )                        ⊗                          ⟨                                                P                  c                                ,                1                            ⟩                                                                    =                      (                                          ⟨                                                      P                    a                                    ,                                      S                    a                                                  ⟩                            ⊗                              ⟨                                                      P                    b                                    ,                                      P                    b                                                  ⟩                                      )                                                        =                      ⟨                                          P                c                            ,              1                        ⟩                                                        =                      ⟨                                          P                c                            ,                              S                c                                      ⟩                              
Thus, a segmented parallel-prefix is associative, and our tree switch circuit can be viewed as an ordinary parallel-prefix circuit with a certain associative operator. (See [5, Exercise 30-1].)
1.1.3 Variations on Prefix Circuits
Often, the prefix is modified slightly from the formulae given above. For example, an inclusive prefix has       y    i    =            ⊗              k        =        0            i        ⁢          x      k      
instead of       y    i    =            ⊗              k        =        0                    i        -        1              ⁢                  x        k            .      
An inclusive segmented prefix has             y      i        =                  ⊗                  j          =                                    k              i                        +            1                          i            ⁢              x        i              ,
instead of       y    i    =            ⊗              j        =                  k          i                            i        =        1              ⁢                  x        i            .      
Sometimes it is useful to have a backwards prefix operation. An exclusive backwards prefix operation is       y    i    =            ⊗              k        =                  i          +          1                            N        -        1              ⁢                  x        k            .      
Similarly, an inclusive version can be made with and without segmentation.
When implementing these circuits, one must be careful to get the xe2x80x9cfencepostxe2x80x9d conditions right. That is, the lower and upper bounds to the summation index must be thought through carefully to avoid designed-in errors in the circuits.
Note: Sometimes prefix circuits are called xe2x80x9cscan chainsxe2x80x9d (See, e.g., [9,27].)
1.2 Superscalar Processors
The second background area for this invention is superscalar processors. FIGS. 8-10 show a six-stage pipeline processor 100 illustrative of how today""s superscalar processors are organized. The stages, are Fetch, Rename, Analyze, Schedule, Execute, and Broadcast. An example set of three instructions is shown as they propagate through each of the stages.
The Fetch stage comprises an arithmetic logic unit (ALU) 105, a program counter 110, a multiplexer 115, and a pipeline register 120. Program counter 110 keeps track of the next instruction to be executed. The program counter""s value is used as a memory address to fetch several words from an instruction cache (xe2x80x9cI-cachexe2x80x9d (not shown).) In this example, four instructions are fetched from the I-cache. The new program counter is computed by adding a value to the old program count. The value added depends on prediction logic (labeled xe2x80x9cPREDICTxe2x80x9d.) If the prediction logic predicts that there are no taken branches in the next four instructions, then the new program count is the old program count plus 4. If the prediction logic predicts that one of the instructions is a taken branch, then the predictor selects the branching instruction, and the xe2x80x9coffsetxe2x80x9d field of the instruction is added to the old program count to get the new program count. The logic for handling absolute branches and mispredictions is not shown here. The four instructions are latched in pipeline register 120 between the Fetch and Rename stages.
The Rename stage comprises renaming logic 125 including a renaming table 130 and a pipeline register 135. The rename stage takes instructions and rewrites them to make it easier to execute them in parallel. In the example shown the instructions are
0: R0:=R1+R2
1: R3:=R0/R1
2: R0:=R1/R5
Note that Instruction 2 is logically independent of Instruction 1, and they may run in parallel. Instruction 2 writes to Register R0, however, and if that write happens before Instruction 1 reads from Register R0, then the wrong value will be used by Instruction 1. This problem is referred to as a xe2x80x9cwrite-after-readxe2x80x9d hazard. To allow parallel execution of Instructions 1 and 2 without suffering from the write-after-read hazard, the renaming stage rewrites the instruction to use a different set of registers, called tags. The program is transformed into
0: T42:=T35+T40
1: T43:=T42/T35
2: T44:=T35/T41
In this case we assumed that the most recent writer of Register R1 had been renamed to write to Tag T35, the most recent R2 to T40, and the most recent R5 to T41. Three new tags are allocated: T42, T43, and T44, and the registers are renamed appropriately. To facilitate this renaming a Renaming Table is used to provice a mapping from register names to tag names.
FIG. 11 shows the contents of the renaming table before and after renaming each of the three instructions mentioned above. Note that the table renames all the registers, not just the ones mentioned earlier. Thus, initially R0 is renamed to T30, and then after renaming Instruction 0, R0 is renamed to T42, and then after renaming Instruction 2, R0 is renamed to T44. Register R3 was initially renamed to T25 and was renamed to T43 after Instruction 1. The renaming for R2 was not affected by these three instructions, since none of the instructions mentioned R2 as a destination register.
The circuit for identifying free tags is not shown in FIG. 11. Such a circuit would, in this case, identify up to four unused tags. In our example we assume that the four unused tags are T42 through T45. Some systems allocate and deallocate the tags so that contiguous tags are always allocated, whereas some systems can allocate an arbitrary set of four tags.
The exact implementation of the Rename stage varies from processor to processor. Although we have shown the Rename stage renaming registers to successive tags (T42, T43, and T44), in general, there is no requirement that sequentially allocated tags have sequential numbers. Some superscalar processor implementations do have such a requirement. Others do not. Those that do not require circuitry to identify on every clock cycle up to four unused tags out of the list of unused tags. The renaming table also need not be directly addressed by the logical register number. The Digital Alpha 21264 compresses entries out of the renaming table when they are no longer needed. This means that the table requires an associative lookup, instead of a direct lookup [8].
After renaming, instructions are sent via pipeline register 135 to the Analyze Dependencies stage. This stage includes a reordering buffer 140 and a pipeline register 145. In this stage, instructions whose inputs are all available are identified as ready to run. Reordering buffer 140 keeps track of this dependency information. FIG. 9 illustrates some of the instructions stored in the reordering buffer but does not depict additional information that is ordinarily kept there. Instructions are stored in the reordering buffer in sequential order. The xe2x80x9cOldxe2x80x9d and xe2x80x9cNewxe2x80x9d pointers point at the oldest and the newest instruction in the sequence, respectively. Buffer entries to the left of xe2x80x9cOldxe2x80x9d and to the right of xe2x80x9cNewxe2x80x9d are not currently in use. A signal is produced for each instruction in the buffer indicating whether it is ready to run, and if so, what execution resource it needs. in FIG. 9 an arrow 143 from buffer 140 to register 145 identifies those instructions that are ready to run, and the absence of an arrow identifies those instructions not ready to run. On the arrow, the labels xe2x80x9cMemxe2x80x9d, xe2x80x9cAddxe2x80x9d, or xe2x80x9cDivxe2x80x9d indicatie whether that instruction needs to access memory, an ALU capable of adding, or an ALU capable of dividing. In our example, the instruction at buffer entry 38 is ready to add, the one at entry 41 is ready to access memory, the one at entry 42 is ready to add, and the one at entry 44 is ready to divide. This information is stored in pipeline register 145 between the Analyze Dependencies stage and the Schedule stage.
The Schedule stage comprises a scheduler 150 and a pipeline register 155. This stage assigns specific execution resources to the collection of instructions that are ready to run. In our example, the instructions at entries 38, 41, and 44 are assigned to particular functional units, and the instruction at entry 42 is not scheduled. Scheduler 150 obtains the actual operands from reordering buffer 140, and feeds them via pipeline register 155 to the appropriate functional units in the Execute stage.
The Execute stage comprises an ALU 160 for adding, an ALU 165 for dividing, an interface to memory (168), and a pipeline register 170. This stage executes each of the instructions provided by scheduler 150 so as to actually perform the arithmetic or memory operation. When the result is computed, it notifies the Broadcast stage 175.
The Broadcast stage takes computed results from the Execute stage and broadcasts them back to the Analyze Dependencies stage in FIG. 9. All instructions in the reordering buffer associatively listen to the broadcast. As a result of the broadcast, more instructions may become ready to execute because their dependencies have been satisfied.
Different processors reuse entries in their reordering buffer differently. Those that assign tags serially can use each assigned tag as a direct address into the reordering buffer at which to store the renamed instruction. This is the case in FIG. 9. These processors including the Alpha 21264 write values to a canonical register file by compressing the entries in the reorder buffer so that the instruction in Buffer Entry 0 is always the oldest [8]. When scaled up, the circuitry used in the 21264 for compressing the window requires large area and has long critical-path lengths, however.
When the reorder buffer fills up, some processors exhibit performance anomalies. Some processors empty the reorder buffer, instead of wrapping around, and commit all results to the register file before they start back at the beginning. Some processors wrap around, but start scheduling newer instructions instead of older ones, which can hurt performance. (See [33] for an example of a circuit that works that way.)
In some processors there is an additional decode stage after the fetch stage, but before the rename stage. This stage may interpret a complex instruction stream, translating it to a simpler instruction stream. The Intel Pentium Pro does this.
In most processors there is bypass logic to allow data to move directly from the Broadcast stage to the Execute stage, bypassing the Analyze Dependencies stage and the Schedule stage. We do not show that bypass logic here, but that logic also has long critical-path lengths and large area in today""s designs.
1.2.1 Microprocessor Performance
The standard model for modeling the performance of a microprocessor [14] says that the time to run a program is T=Nxc2x7CPIxc2x7xcfx84 where
N is the number of instructions needed to run the program,
CPI is the number of clock periods per instruction, and
xcfx84 is the length of a clock period in seconds, i.e. the cycle time.
The value of xcfx84 is determined by the critical-path length through any pipeline stage, that is the longest propagation delay through any circuit measured in seconds. Propagation delay consists of delays through both gates and wires, or alternately of delays through transistors driving RC networks. We are not changing N or directly changing CPI, but rather we aim to reduce the clock cycle by redesigning the processor to use circuits with reduced critical-path length.
One way to avoid slowing down the clock is by breaking down the processor into more pipeline stages. Increasing the number of pipeline stages offers diminishing returns, however, as pipeline registers begin to take up a greater fraction of every clock cycle and as more clock cycles are needed to resolve data and control hazards. In contrast, shortening the critical path delay of the slowest pipeline stage translates directly into improved program speed as the clock period decreases and the other two parameters remain unchanged.
The critical-path delays of many of today""s processor circuits do not scale well. For example, Palacharla, Jouppi, and Smith [32J find that many of the circuits in today""s superscalars have asymptotic complexity xcexa9(I2+W2), where I is the issue width (i.e., the maximum number of instructions that are fetched in parallel from the cache) and W is the window size (i.e., the maximum number of instructions within the processor core) of the processor. While delays appear to be practically linear for today""s processors, optimized for I equal to four, and W in the range of 40 to 56, the quadratic terms appear to become important for slightly larger values of I and W. (Note that for today""s processors with large W the window is typically broken in half with a pipeline delay being paid elsewhere. An HP processor sets W=56 [10]. The DEC 21264 sets W=40 [8]. Those systems employ two windows, each half size, to reduce the critical-path length of the circuits. Communicating between the two halves typically requires an extra clock cycle.) Some of today""s circuits have critical-path length that grows at least as fast as xcexa9(I4) and have many components with area and power consumption that grows quadratically "THgr"(I2). (See [30,32,1].) Increasing issue widths and increasing window sizes are threatening to explode the cycle time of the processor.
We have found that a very useful improvement to the parallel prefix circuit can be made by allowing the prefix operation to xe2x80x9cwrap aroundxe2x80x9d from the end back to the beginning. This extension is called Cyclic Segmented Parallel Prefix (CSPP.)
Further, such CSPP circuits are useful to improve the performance of many of the circuits in a superscalar processor. In some situations, it is especially advantageous to use CSPP because it avoids performance penalties in certain situations, for example, when the window fills up or wraps around. This part of the invention also contains a number of other novel circuits that can improve the performance of superscalar processors but that are not prefix circuits.
Further, it is possible to completely reorganize the processor to take advantage of CSPP to do all the scheduling and data movement. We call the resulting processor an Ultrascalar processor.
The circuits used in our invention all grow much more slowly than those of the superscalar processor, with gate delays of O(log I+log W) and wire delays of O({square root over (I)}+{square root over (W)}) for memory bandwidth comparable to today""s processors. The asymptotic advantage of the Ultrascalar over today""s circuits translates to perhaps an order-of-magnitude or more advantage when W is on the order of several hundreds or thousands, a design point advocated by [35].
The Ultrascalar processor, our completely reorganized architecture, breaks the scalability barrier by completely restructuring the microarchitecture of the processor. The Ultascalar turns the processor""s datapath into a logarithmic depth network that efficiently passes data from producer instructions to consumer instructions within the reordering window. The network eliminates the need for separate renaming logic, wake-up logic, bypass logic, and multi-ported register files.