1. Field of the Invention
The invention relates to methods and apparatus for improving the scheduling of instructions in a pipelined loop. In particular, the invention relates to methods and apparatus for implementing variable length delay instructions.
2. Description of Related Art
Generally, a software pipelined loop implements multiple loop iterations, which execute in parallel. This is accomplished by exploiting the fine-grained, instruction level parallelism (IPL) that is available across the branches of a loop. The goal of software pipelining is to maximize the number of loop iterations executing in parallel and, thereby, increase loop throughput. Software pipelining is an instruction scheduling technique. Pipelining is a method for executing instructions in an assembly-line fashion. It is a design technique for reducing the effective propagation delay per series of stages, each of which performs a portion of the operation. A series of data is typically clocked through the pipeline in sequential fashion, advancing one stage per clock period. A scheduling algorithm for use in software pipelining constructs a legal instruction schedule.
The rate at which new loop iterations are started is called the iteration interval (II). The minimum iteration interval (MII) is the lower bound on the iteration interval for which a valid schedule may be established. Modulo scheduling is a known technique for establishing software pipelined-loops. In modulo scheduling, the schedule for a single iteration is divided into a sequence of stages, each having the length of the iteration interval. In the steady state execution of the software pipelined-loop, each of the stages executes parallel.
The instruction schedule for a software-pipelined loop has at least three (3) components: a kernel, a prolog, and an epilog. The kernel is the instruction schedule that executes in the steady state. In the kernel, an instruction scheduled at the kth cycle executes in parallel with all the instructions scheduled at the kth cycle modulo iteration interval. This is known as the modulo constraint, and this constraint defines the basic feature of xe2x80x9cmodulo scheduling.xe2x80x9d The prolog and epilog are the instruction schedules that setup, i.e., xe2x80x9cprimexe2x80x9d the software pipelines, and conclude the execution of the loop kernel, i.e., xe2x80x9cdrainxe2x80x9d the software pipelines, respectively.
Referring to FIG. 1, an example of modulo scheduling is depicted. A, B, C, D, and E are the stages of a loop iteration. In the steady state, all five (5) stages execute in parallel. However, each executes from a different iteration, i.e., at a different stage with respect to the previous iteration. A stage in the software pipeline is completed after the cycles required for each iteration interval (II) are completed. A single iteration produces a predetermined result in the time required to complete five (5) stages. In the steady state of the software-pipelined loop, however, a result is available upon completion of each stage, i.e., after the cycles required to complete each iteration interval. This is one motivation for performing software pipelining. The minimum iteration interval required for a software pipelined-loop is determined by resource and dependency constraints.
The lifetime of any value in any such loop is the number of cycles between the placement in the schedule for the operation that defines a particular value (the xe2x80x9cdefxe2x80x9d) and the last operation that uses that value (the xe2x80x9clast usexe2x80x9d). The value must be xe2x80x9calive,xe2x80x9d i.e., available for recovery, until its last use. A schedule is invalidated if a value has a lifetime that is greater than iteration interval. If a value xe2x80x9clivesxe2x80x9d too long, i.e., its lifetime is greater than the iteration interval, an operation in a successive iteration executing in parallel with the current iteration may overwrite the value before its last use in the current iteration. Referring to FIG. 2, two (2) iterations are shown from the execution trace of a software-pipelined loop. Each Si is an operation that computes a result in one cycle. Assuming an architecture similar to the c6x architecture, available from Texas Instruments, Inc., of Dallas, Tex., a variable may be written and read on the same cycle, such that the read step logically occurs before the write step. The operations depicted in FIG. 2 are annotated to indicate whether they define or use the variable v.
A register is a small area of a high-speed memory, located within a processor or electronic device that is used for temporarily storing information or instructions. Each register is given a name, contains a few bytes of information and is referenced by programs. Values (v1, v2, v3, . . . ) generated in each loop iteration exist sequentially in a single register.
Referring again to FIG. 2, however, the lifetimes of values v1 and v2 are depicted as overlapping. This overlapping causes v2 to overwrite (or kill) v1 before the last values are used. Such excessive lifetimes may occur as an artifact of the scheduling algorithm or due to data dependencies that split and then join in the data precedence graph for the loop block. The values for v1 and v2 (actually v1, v2, v3, . . . , vN; where N equals the number of loop iterations, i.e., the loop trip count) are sharing the same register. However, only one value v# may exist on the register at a time.
This problem is experienced in VLIW architectures. Moreover, this problem is experienced in architectures that implement and support instruction level parallelism and use instruction scheduling techniques, such as software pipelining, to exploit parallelism.
One technique for solving the problem of excessive lifetimes is to insert move instructions into the loop in order to write the value to another register before the value is overlapped by a successive iteration. Nevertheless, the lifetime of a register value may be so long that it may take multiple move instructions to split the register value into multiple live ranges and thereby avoid overlapping. Moreover, insertion of these additional move instructions may reduce the performance of the loop and may further complicate the scheduling and register allocation processes. Another technique is modulo variable expansion. In this technique, the kernel is unrolled, and values with overlapping lifetimes are renamed.
Pipelined loops may be especially important for the efficient and effective programming of digital signal processors (DSPs), such as those used extensively in the telecommunications industry. In DSPs, operating time and information storage are critical constraints on programming. In particular, information storage in registers and memory is extremely limited, and allocation of these resources limits the number of operations, which may be performed simultaneously. Further, the need to store information to and to retrieve information memory imposes significant time constraints on DSP operation. The time which a DSP spends xe2x80x9cwaitingxe2x80x9d for the retrieval of information from memory limits the speed with which the DSP may complete operations and, therefore, may limit the total number of operations which the DSP may execute within a given period of time.
Thus, a need has arisen for a method for eliminating, reducing, or compensating for the problem of excessive and overlapping value lifetimes by writing information (values) to another register before the information (values) is (are) overwritten by a successive iteration. Further, a need has arisen for a method of writing information one register to another register with a variable delay operand. Further, a need has arisen for a method and apparatus for delaying the use of information without storing it to memory and retrieving it or holding it for a prolonged period within a single register.
An illustrative embodiment of the present invention seeks to provide a microprocessor, and method for operating a microprocessor, that improves digital signal processing performance. Aspects of the invention are specified in the claims. In an embodiment of the present invention, a digital signal processor is provided which has improvements over prior art devices including additional instructions that reduce code size and increase register flexibility.
In an embodiment, the invention is a method for implementing a variable length delay instruction. The method comprises the steps of defining an operation, which may include any type of machine instruction (e.g., MV, ADD, MPY, and the like); determining a delay between the defined operation and a future using operation; and implementing a queuing or delaying means, which may be part of the defined operation or a separate operation (or apparatus). Depending on the nature of the defined operation, the method may further include the steps of defining source operands, memories, or registers and defining a destination operand, memory, or register.
In another embodiment, the invention is a method for implementing a variable length delay instruction with respect to a move instruction. The method includes the steps of designating a source register for holding information and designating a destination register for retrieving the information. A first number of cycles before retrieval of the information to the destination register then is determined, and the information is transferred from the source register to delaying device, such as queuing device, for the first number of cycles. Finally, the information is written from the delaying device to the destination register.
For example, a new move instruction writes information from one register to another register with a variable delay operand. Generally, operands are the parts of an instruction that designates where a central processing unit (CPU) will fetch or store information. Specifically, the move instruction may be of the form: MVsrc_reg, dst_reg, delay. By this instruction, information or a value from the source register, src_reg, is written to the destination register, dst_reg, but the arrival of the information or value from the source register at the destination register is delayed by a predetermined number of cycles, delay.
In a further embodiment, the invention is an apparatus for implementing variable length delay instructions. The apparatus includes an input line for reading information from a defining operation or a source register; delaying device for receiving the information read from the defining operation or the source register; a multiplexer; and a select line. A trigger signal is transmitted to the multiplexer via the select line, thereby instructing the multiplexer to write the information to a using operation or a destination register.
In yet another embodiment, the invention is an apparatus for implementing variable length delay instructions. The apparatus includes an input line for reading information from a defining operation or a source register; a queue of serially-connected, holding devices for receiving the information read from the defining operation the source register; a multiplexer aligned in parallel with the serially-connected, holding devices; and a select line. A trigger signal is transmitted to the multiplexer via the select line, thereby instructing the multiplexer to write the information to a using operation or a destination register.
In still another embodiment, the invention is an apparatus for implementing variable length delay instructions. The apparatus includes an input line for reading information from a defining operation or a source register; a plurality of two-way multiplexers, each of the multiplexers placed between a pair of holding devices; and a select line. A trigger signal is transmitted to each of the plurality of multiplexers, thereby instructing at least one of the multiplexers to write the information to a using operation or a destination register.
It is an advantage of this technique that the loop scheduler may generate improved or optimal software pipelines. It is a feature of this technique that while it may not prevent excessive lifetime values, it reduces or minimizes the number of move instructions required to prevent overlapping lifetimes. It is a further feature of this technique that the delay mechanism may be used to minimize the register information (value) lifetimes.
It is a further advantage of this invention that it avoids the complexities of implementing the hardware of a rotating register file to modify the register file or modulo variable expansion for unrolling the loop kernel and renaming the register values that overlap.
Other objects, features, and advantages will be apparent to persons skilled in the art by the following detailed description and the accompanying drawings.