The present invention relates to computer architecture and, more particularly, to a method for computing prefix sums and the application of the method to allocating computational resources to concurrently executable tasks, for example, ensuring conflict-free access to multiple single-ported register files.
The definition of a prefix sum is as follows: given an input base, x, and an input array y of k elements, y1 through yk, the prefix sum of the base and the array is an output array, of k+1 elements, whose elements are:
z0=x
z1=x+y1
z2=x+y1+y2
. . . 
zk=x+y1+y2+ . . . +yk
The latest generation of computer processors is capable of overlapping the performance of several instructions, and even of issuing several instructions per clock cycle. These instructions may belong to one or several xe2x80x9ccomputational threadsxe2x80x9d. Methods known to the art for coordination among instructions include scoreboarding and Thomasulo""s algorithm for uniprocessors, used primarily to efficiently enforce precedence among performed instructions; and xe2x80x9cfetch and addxe2x80x9d for multiprocessors. These methods are described in standard references, such as xe2x80x9cComputer Architecture, a Quantitative Approachxe2x80x9d, by John L. Hennessy and David A. Patterson, and xe2x80x9cHighly Parallel Computingxe2x80x9d, by George S. Almasi and Allan Gottlieb. As the instruction level parallelism of uniprocessors increases, some of the coordination efforts needed on uniprocessors will begin to somewhat resemble those needed on multiprocessors. A sufficiently low-level implementation of prefix summation would be a valuable additional tool for coordinating multiple instructions from a single thread or from multiple computational threads, as well as being a useful primitive operation for higher-level applications to use. It would be highly advantageous to have both a low level implementation of prefix summation and methods of coordinating multiple computational threads that exploit such an implementation.
According to the present invention there is provided a method for performing a prefix sum of a base and an input array including at least one input element, comprising the step of providing a prefix sum instruction in an instruction set architecture of a microprocessor.
According to the present invention, there is provided a method for performing a base-zero prefix sum of an array including at least one element, comprising the step of providing a base-zero prefix sum instruction in an instruction set architecture of a microprocessor.
According to the present invention, there is provided a method for performing a base-zero suffix sum of an array including at least one element, comprising the step of providing a base-zero prefix sum instruction in an instruction set architecture of a microprocessor.
According to the present invention, there is provided a functional unit for performing the prefix sum of a base and an input array including at least one input element, comprising: (a) a first logical unit, for performing a base-zero prefix sum of the input array, thereby providing a first intermediate array including at least one element, one of the at least one element being a last element; and (b) a second logical unit, for performing a base-zero suffix sum of the input array, thereby providing a second intermediate array including at least one element.
According to the present invention, there is provided a method of providing conflict-free access to a plurality of register files for a plurality of computational cycles, each of the computational cycles requiring access to at least one value having an ordinal number, there being a largest of the at least one ordinal number, each of the computational cycles having a counter, the method comprising the steps of: for each computational cycle: for each value, adding the counter to the ordinal number, thereby providing a serial number; thereby providing, for each computational cycle, a largest serial number.
To date, no microprocessor has included a prefix sum instruction in its instruction set architecture. The present invention is precisely that: the inclusion of a prefix sum instruction in the instruction set architecture. Such an instruction is expressed in instruction code using syntax of the form
PSRiRj.
Here, xe2x80x9cPSxe2x80x9d is the instruction name, and Ri and Rj are registers. One PS instruction by itself adds the value in register Ri to the value in register Rj, returns the result to register Ri, and stores the original value of Ri in Rj. Any other syntax having the same meaning, and including an instruction label field and at least two operand fields, may be used instead. In and of itself, this instruction has an effect similar to that of an xe2x80x9caddxe2x80x9d instruction. The difference between the PS instruction and an xe2x80x9caddxe2x80x9d instruction is that several PS instructions may be cascaded into a multiple-PS instruction: for example, the sequence of instructions
PSR0R1
PSR0R2
PSR0R3
. . . 
PSR0Rk
performs the prefix sum of the input base, stored in register R0, and the input array, stored in registers R1 through Rk. Upon completion of this sequence of instructions, register R1 contains the value originally stored in R0; register R2 contains the sum of the values originally stored in registers R0 and R1; register R3 contains the sum of the values originally stored in registers R0 through R2; and so on until register Rk, which contains the sum of the values originally stored in registers R0 through Rkxe2x88x921. Finally, R0 contains the sum of all the values originally stored in all k+1 registers R0 through Rk. The microprocessor, through static analysis and dynamic analysis, and through permuting the order of instructions, recognizes a plurality of consecutive single PS instructions, which can comprise a multiple-PS instruction without changing the semantics of the original code. The PS instructions then are decoded collectively as a prefix summation, if possible replacing groups of two or more consecutive single PS instructions by multiple-PS instructions (the allowed multiplicity in a multiple-PS instruction may vary greatly among microprocessors whose instruction set architecture includes a prefix sum).
Prefix summation may be implemented using the instructions of existing instruction sets, for example as a balanced binary tree. Balanced binary tree algorithms for prefix summation are well-known in the art, being presented, for example, in Joseph J]J], xe2x80x9cAn Introduction to Parallel Algorithmsxe2x80x9d, pp. 43-49, which is incorporated by reference for all purposes as if fully set forth herein. Preferred embodiments of the present invention, however, implement prefix summation in hardware, so that short prefix summations may be completed within one clock cycle and therefore used, for example, for the allocation of computational resources to concurrent tasks, including the allocation of memories and functional units, and including load balancing among tasks coming from single or multiple computational threads. For example, suppose there are three computational threads, each needing three independent adders for its next three instructions, suppose that there are five computational units that can serve as adders, and suppose that each addition requires one clock cycle. Prefix sums may be used to assign unique serial numbers to the instructions so that the adders can be allocated to the instructions and the nine additions completed in two clock cycles. Note that in this application it does not matter in which order the elements of the prefix sum output array are computed.
In a microprocessor configured to do more than one prefix sum concurrently, the two register fields of the prefix sum instruction can serve to indicate to the microprocessor that several prefix sums are to be done in parallel. Specifically, if two or more different values appear in the base field of cascaded PS instructions, and the sets of values in the array field associated with the various base values are disjoint, then each base value is associated with a different prefix sum, and all the prefix sums, being mutually independent, are performed concurrently.
The scope of the present invention also includes two special case of prefix summation. The first special case is the case of a xe2x80x9cbase-zeroxe2x80x9d prefix sum, in which the input base x is hardwired to zero. The second special case is the case of a base-zero suffix sum. This is the same as a base-zero prefix sum, except that the elements of the input array are subtracted from zero, in reverse order. In other words, the base-zero suffix sum of an input array y of k elements, y1 through yk, is an output array z, of k+1 elements, whose elements are:
z0=0
z1=xe2x88x92yk
z2=xe2x88x92ykxe2x88x92ykxe2x88x921
. . . 
zkxe2x88x921=xe2x88x92ykxe2x88x92ykxe2x88x921xe2x88x92 . . . xe2x88x92y2
zk=xe2x88x92ykxe2x88x92ykxe2x88x921xe2x88x92 . . . xe2x88x92y2xe2x88x92y1
The preferred instruction for a base-zero prefix sum is expressed in instruction code using syntax of the form
BZPSRi.
Here, xe2x80x9cBZPSxe2x80x9d is the instruction name, and Ri is a register. Note that because the input base is fixed at zero, only one register field is needed in the instruction. Any other syntax having the same meaning, and including an instruction label field and at least one operand field, may be used instead. A single base-zero prefix sum instruction has an effect similar to a no-op: adding zero to the number stored in the register. Like the general prefix sum instruction, the base-zero prefix sum instruction acquires significance when it is cascaded: for example, the sequence of instructions
BZPSR1
BZPSR2
BZPSR3
. . . 
BZPSRk
performs the base-zero prefix sum of the input array stored in registers R1 through Rk. Upon completion of this sequence of instructions, register R1 contains the value originally stored therein; register R2 contains the sum of the values originally stored in registers R1 and R2; register R3 contains the sum of the values originally stored in registers R1 through R3; and so on until register Rk, which contains the sum of all the values originally stored in all k registers R1 through Rk.
The preferred instruction for a base-zero suffix sum is expressed in instruction code using a syntax analogous to the syntax of the base-zero prefix sum:
BZSSRi.
As in the case of the base-zero prefix sum, the base-zero suffix sum acquires significance only when it is cascaded.
A hybrid syntax is used to indicate that a suitably configured microprocessor is to perform several base-zero prefix sums in parallel. This syntax is of the form
BZPSIiRj.
Here, xe2x80x9cBZPSxe2x80x9d is the instruction name, Ii is an index, and Rj is a register. As before, any other syntax having the same meaning, and including an instruction label field and two operand fields, may be used instead. In cascaded two-operand BZPS instructions, the register fields indicate the registers holding the array elements to be summed, and the indices are used to distinguish between the different independent base-zero prefix sums.
As in the case of general prefix summation, the microprocessor, through static analysis and dynamic analysis, and through permuting the order of the instructions, recognizes a plurality of consecutive single BZPS instructions, which can comprise a multiple-BZPS instruction without changing the semantics of the original code. The BZPS instructions are decoded collectively as a base-zero prefix summation, if possible replacing groups of two or more consecutive single BZPS instructions by multiple-BZPS instructions. The preferred implementation of base-zero prefix summation is as a balanced binary tree in dedicated hardware.
The scope of the present invention also includes the application of prefix summation to a particular aspect of the coordination of multiple computational threads, that of guaranteeing conflict-free access to multiple single-ported register files, that is, making sure that different reads and writes to the same register files do not occur simultaneously.
B. R. Rau. C. D. Glaeser, and R. L. Picard (xe2x80x9cEfficient code generation for horizontal architectures: compiler techniques and architectural supportxe2x80x9d, Proc. ISCA, pp. 131-139, 1982) use scratch pad register files for this purpose. In their approach, finding a viable schedule of operations depends on assigning each temporary value to a scratch pad in a way which is conflict-free. This is done by first forming xe2x80x9cmaximal compatibility classesxe2x80x9d (MCCs), whereby two values are xe2x80x9ccompatiblexe2x80x9d if they are neither both read nor both written during the same cycle. All members of an MCC may be assigned to the same scratch pad. A xe2x80x9ccoverxe2x80x9d is a set of MCCs whose union includes all the values that must be assigned. If the number of MCCs in a cover does not exceed the number of scratch pads, conflict-free access is guaranteed by assigning all the values in one MCC to the same scratch pad.
A solution that requires a large number of registers was implemented by G. R. Beck, D. W. L. Yen, and T. L. Anderson (xe2x80x9cThe Cydra 5 minisupercomputer: architecture and implementationxe2x80x9d, The Journal of Supercomputing, vol. 7 pp. 143-180, 1993) and by J. C. Dehnert and R. A. Towle (xe2x80x9cCompiling for the Cydra 5xe2x80x9d, The Journal of Supercomputing, vol. 7 pp. 181-228, 1993) in the Cydra 5 using a context register matrix, in which each functional unit has a dedicated row for writes, and each row can be read in parallel by each functional unit through a designated column. This structure permits conflict-free register reads and writes for every functional unit. A similar approach (S. P. Song, M. Denman, and J. Chang, xe2x80x9cThe PowerPC 604 RISC microprocessorxe2x80x9d, IEEE Micro, vol. 14 no. 5 pp. 8-17, October 1994) is taken in the current generation of superscalar processors.
The method of the present invention for guaranteeing conflict-free access uses substantially less storage hardware than the context register matrix solution. The context register matrix solution requires a number of registers proportional to the square of the number of functional units and storage of multiple copies of variables. The method of the present invention requires only a number of registers proportional to the number of functional units. Furthermore, each scratch pad needs only one read port and only one write port. The method of the present invention works by allowing values to move among the scratch pads. In each clock cycle, all the values which are used reside in different scratch pads. However, to be useful, the method of the present invention requires a priori knowledge: in which future step will a value just read or updated be needed? In addition, the scope of the method of the present invention does not include communication hardware.