Software pipelining refers to a method for changing the order of instructions in a logical loop in a program code executed in a computing environment, to optimize the total execution process. The software pipelining method applies instruction scheduling techniques to efficiently overlap successive iterations of logical loops in the program code and execute them in parallel in a multiprocessing computing environment.
A software pipelining scheme may be used to execute series of instructions in the loop where possible in advance, while other series of instructions belonging to a previous phase of the pipeline are being concurrently executed. The pipelining allows for look-ahead processing of certain values for a future stage of the loop, while processing certain values for a current stage of the loop.
When a compiler software pipelines a loop, some variables typically need to be assigned to several distinct registers to initiate and support the pipelining process. Since values for a single variable (e.g., variable X) are being calculated concurrently by instructions at different stages of the loop, several registers (as opposed to a single register) need to be allocated to the same variable. The number of registers that are allocated to a variable may be determined in advance by reviewing the logic code for the loop.
Two problems may arise in software pipelining. First, the system may run out of available registers. Second, the need to access distinct registers explicitly requires inserting register copy instructions or unrolling of the loop, or specially designated hardware, which can be costly in terms of the associated overhead as provided in more detail below. For example, one method for managing and allocating the various registers is to use multiple scalar registers (e.g., 32-bit wide registers) to store the different values of a variable at different stages. If the value for a variable X is being concurrently calculated for various stages of the pipeline, then multiple scalar registers may be used to maintain the various values.
Referring to FIG. 1(a), for example, four scalar registers SR1 through SR4 are illustrated, wherein each scalar register is respectively allocated to hold one of the four values for variable X (i.e., X1, X2, X3, X4) at each stage of a pipeline. In this example, since the loop may be executed more than four iterations, the four registers need to be updated in a rotating scheme, such that the oldest value is discarded from SR1 at each iteration and the value stored in the remaining registers (i.e., SR2, SR3 and SR4) is moved over to the next register.
Referring to FIG. 1(b), the value in SR2 is moved to SR1 thereby deleting the value X1, the value in SR3 is moved to SR2, the value in SR4 is moved to SR3, so that the last register SR4 is available for a newly calculated value for X (e.g., X5). As shown, X2, X3, X4, X5 represents the respective values for X as stored in registers SR1 through SR4, after the four separate instructions MOVE, MOVE, MOVE, and COPY are executed to shift and copy the respective values among the registers.
Referring to FIG. 1(c), another set of four separate instructions (i.e., MOVE, MOVE, MOVE, COPY) need to be executed to store the values for X in the next pipeline stage in registers SR1 through SR4. As shown, after said four separate instructions are executed, the values for X are shifted to the left by one to allow a new value X6 to be stored in SR4; oldest value for X (i.e., X2) is discarded to make the shift to the left possible.
Unfortunately, the above shifting scheme using series of scalar registers is undesirable. Such shifting scheme results in substantial overhead in memory management and execution resources since it requires maintaining multiple scalar registers for each value and multiple instructions will have to be executed for shifting/rotating the values among the registers at each iteration. Rotating register files may be implemented in hardware. However, not all processors support rotating register files in hardware, as it may not be cost-effective overall.
As such, the current schemes (e.g., loop unrolling and a hardware implementation of the rotating scheme) have drawbacks and disadvantages in that they either result in an increase in code size or a reduction in performance, or increased hardware complexity. Methods and systems are needed that can overcome the aforementioned shortcomings.