The present invention pertains generally to pipelined microprocessors, and pertains more particularly to methods and microprocessor structures for executing software pipelined loops.
A microprocessor is a circuit that combines the instruction-handling, arithmetic, and logical operations of a computer on a single chip. A digital signal processor (DSP) is a microprocessor optimized to handle large volumes of data efficiently. Such processors are central to the operation of many of today""s electronic products, such as high-speed modems, high-density disk drives, digital cellular phones, and complex automotive systems, and will enable a wide variety of other digital systems in the future. The demands placed upon DSPs in these environments continue to grow as consumers seek increased performance from their digital products.
Designers have succeeded in increasing the performance of DSPs generally by increasing clock frequencies, by removing architectural bottlenecks in DSP circuit design, by incorporating multiple execution units on a single processor circuit, and by developing optimizing compilers that schedule operations to be executed by the processor in an efficient manner. As further increases in clock frequency become more difficult to achieve, designers have embraced the multiple execution unit processor as a means of achieving enhanced DSP performance. For example, FIG. 1 shows a block diagram of a DSP execution unit and register structure. having eight execution units, L1, S1, M1, D1, L2, S2, M2, and D2. These execution units operate in parallel to perform multiple operations, such as addition, multiplication, addressing, logic functions, and data storage and retrieval, simultaneously.
Theoretically, the performance of a multiple execution unit processor is proportional to the number of execution units available. However, utilization of this performance advantage depends on the efficient scheduling of operations such that most of the execution units have a task to perform each clock cycle. Efficient scheduling is particularly important for looped instructions, since in a typical runtime application the processor will spend the majority of its time in loop execution.
Traditionally, the compiler is the piece of software that performs the scheduling operations. The compiler is the piece of software that translates source code, such as C, BASIC, or FORTRAN, into a binary image that actually runs on a machine. Typically the compiler consists of multiple distinct phases. One phase is referred to as the front end, and is responsible for checking the syntactic correctness of the source code. If the compiler is a C compiler, it is necessary to make sure that the code is legal C code. There is also a code generation phase, and the interface between the front-end and the code generator is a high level intermediate representation. The high level intermediate representation is a more refined series of instructions that need to be carried out. For instance, a loop might be coded at the source level as: for(I=0,1 less than 10,1=1+1), which might in fact be broken down into a series of steps, e.g. each time through the loop, first load up I and check it against 10 to decide whether to execute the next iteration.
A code generator of the code generator phase takes this high level intermediate representation and transforms it into a low level intermediate representation. This is closer to the actual instructions that the computer understands. An optimizer component of a compiler must preserve the program semantics (i.e. the meaning of the instructions that are translated from source code to an high level intermediate representation, and thence to a low level intermediate representation and ultimately an executable file), but rewrites or transforms the code in a way that allows the computer to execute an equivalent set of instructions in less time.
Source programs translated into machine code by compilers consists of loops, e.g. DO loops, FOR loops, and WHILE loops. Optimizing the compilation of such loops can have a major effect on the run time performance of the program generated by the compiler. In some cases, a significant amount of time is spent doing such bookkeeping functions as loop iteration and branching, as opposed to the computations that are performed within the loop itself. These loops often implement scientific applications that manipulate large arrays and data instructions, and run on high speed processors. This is particularly true on modern processors, such as RISC architecture machines. The design of these processors is such that in general the arithmetic operations operate a lot faster than memory fetch operations. This mismatch between processor and memory speed is a very significant factor in limiting the performance of microprocessors. Also, branch instructions, both conditional and unconditional, have an increasing effect on the performance of programs. This is because most modern architectures are super-pipelined and have some sort of a branch prediction algorithm implemented. The aggressive pipelining makes the branch misprediction penalty very high. Arithmetic instructions are interregister instructions that can execute quickly, while the branch instructions, because of mispredictions, and memory instructions such as loads and stores, because of slower memory speeds, can take a longer time to execute.
One effective way in which looped instructions can be arranged to take advantage of multiple execution units is with a software pipelined loop. In a conventional scalar loop, all instructions execute for a single iteration before any instructions execute for following iterations. In a software pipelined loop, the order of operations is rescheduled such that one or more iterations of the original loop begin execution before the preceding iteration has finished. Referring to FIG. 7, a simple scalar loop containing 20 iterations of the loop of instructions A, B, C, D and E is shown. FIG. 8 depicts an alternative execution schedule for the loop of FIG. 7, where a new iteration of the original loop is begun each clock cycle. For clock cycles I4-I19 in the same instruction (An, Bnxe2x88x921, Cnxe2x88x922, Dnxe2x88x923, Enxe2x88x924) is executed each clock cycle in this schedule; If multiple execution units are available to execute these operations in parallel, the code can be restructured to perform this repeated instruction in a loop. The repeating pattern of A,B,C,D,E (along with loop control operations) thus forms the loop kernel of a new, software pipelined loop that executes the instructions at clock cycles I4-I19 in 16 loops. The instructions executed at clock cycles I, through 13 of FIG. 8 must still be executed first in order to properly xe2x80x9cfillxe2x80x9d the software pipelined loop; these instructions are referred to as the loop prolog. Likewise, the instructions executed at clock cycles 120 and 123 of FIG. 8 must still be executed in order to properly xe2x80x9cdrainxe2x80x9d the software pipeline; these instructions are referred to as the loop epilog (note that in many situations the loop epilog may be deleted through a technique known as speculative execution).
The simple example of FIGS. 7 and 8 illustrates the basic principles of software pipelining, but other considerations such as dependencies and conflicts may constrain a particular scheduling solution. For an explanation of software pipelining in more detail, see Vicki H. Allan, Software Pipelining, 27 ACM Computing Surveys 367 (1995).
One disadvantage of software pipelining is the need for a specialized loop prolog for each loop. The loop prolog explicitly sequences the initiation of the first several iterations of a pipeline, until the steady-state loop kernel can be entered (this is commonly called xe2x80x9cfillingxe2x80x9d the pipeline). Steady-state operation is achieved only after every instruction in the loop kernel will have valid operands if the kernel is executed. As a rule of thumb, the loop kernel can be executed in steady state after k=Ixe2x88x92m clock cycles, where I represents the number of clock cycles required to complete one iteration of the pipelined loop, and m represents the number of clock cycles contained in one iteration of the loop kernel (this formula must generally be modified if the kernel is unrolled).
Given this relationship, it can be appreciated that as the cumulative pipeline delay required by a single iteration of a pipelined loop increases, corresponding increases in loop prolog length are usually observed. In some cases, the loop prolog code required to fill the pipeline may be several times the size of the loop kernel code. As code size can be a determining factor in execution speed (shorter programs can generally use on-chip program memory to a greater extent than longer programs), long loop prologs can be detrimental to program execution speed. An additional disadvantage of longer code is increased power consumption-memory fetching generally requires far more power than CPU core operation.
One solution to the problem of long loop prologs is to xe2x80x9cprimexe2x80x9d the loop. That is, to remove the prologue and execute the loop more times. To do this, certain instructions such as stores, should not execute the first few times the loop is executed, but instead execute the last time the loop is executed. This could be accomplished by making those instructions conditional and allocating a new counter for every group of instructions that should begin executing on each particular loop iteration. This, however, adds instructions for the decrement of each new loop counter which could cause lower loop performance. It also adds code size and extra register pressure on both general purpose registers and conditional registers. Because of these problems, priming a software pipelined loop is not always possible or desirable.
The present invention seeks to reduce code size, power consumption, and processing delays by eliminating at least a portion of the pre-compiled loop prolog and loop kernel code required by prior art software pipelining. The present invention eliminates explicit repetitive loop prolog instructions by xe2x80x9ccompilingxe2x80x9d or xe2x80x9cbuildingxe2x80x9d the loop kernel on the fly, while at the same time executing prolog instructions in a looped fashion. A microprocessor according to the present invention thus comprises hardware that can accomplish this xe2x80x9cbuildingxe2x80x9d function while looping through and executing previously issued instructions.
In one aspect, the present invention provides a multiple execution unit microprocessor comprising an instruction pipeline, a loop cache memory in communication with the instruction pipeline, and a loop cache controller. The instruction loop cache memory is capable of storing and retrieving up to N cycles of execution unit instructions for each execution unit. The loop cache memory is capable of selectively issuing instructions to the execution units from the instruction pipeline and from the loop cache memory. The loop cache controller preferably enables the instruction loop cache memory in response to a software command, and controls storage and retrieval of instructions from the loop cache. The controller preferably also detects exit conditions for the cached loop. It is also preferable that the controller function uses the microprocessor program counter. The controller may also stall and restart the instruction pipeline of the microprocessor appropriately in response to loop caching commands.
In another aspect, the present invention provides a method for operating a multiple execution unit processor having a loop cache memory and an instruction pipeline. The method comprises the steps of building a software pipelined loop kernel in the loop cache memory, stalling the instruction pipeline, executing at least one iteration of the loop kernel using the kernel instructions in the loop cache memory, detecting a loop exit condition, and resuming the instruction pipeline. Preferably, instructions may be issued from the loop cache memory while the loop kernel is being built, thus allowing reductions in code size.
One advantage of the current invention is the power saved in that less stored values are pulled from memory.
A second advantage of the current invention is the power saved in that when memory loads are made, they are made from a cache not from memory.
A third advantage of the current invention is the reduction in the amount of code necessary to effect the loop instructions.