The present invention is directed, in general, to digital signal processors (DSPs) and, more specifically, to a instruction queue for executing and retiring instructions in a DSP.
Over the last several years, DSPs have become an important tool, particularly in the real-time modification of signal streams. They have found use in all manner of electronic devices and will continue to grow in power and popularity.
As time has passed, greater performance has been demanded of DSPs. In most cases, performance increases are realized by increases in speed. One approach to improve DSP performance is to increase the rate of the clock that drives the DSP. As the clock rate increases, however, the DSP""s power consumption and temperature also increase. Increased power consumption is expensive, and intolerable in battery-powered applications. Further, high circuit temperatures may damage the DSP. The DSP clock rate may not increase beyond a threshold physical speed at which signals may traverse the DSP. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional DSPS.
An alternate approach to improve DSP performance is to increase the number of instructions executed per clock cycle by the DSP (xe2x80x9cDSP throughputxe2x80x9d). One technique for increasing DSP throughput is pipelining, which calls for the DSP to be divided into separate processing stages (collectively termed a xe2x80x9cpipelinexe2x80x9d). Instructions are processed in an xe2x80x9cassembly linexe2x80x9d fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the DSP as a whole to become faster.
xe2x80x9cSuperpipeliningxe2x80x9d extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a DSP in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall DSP speed is xe2x80x9csuperscalarxe2x80x9d processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), DSP throughput is increased in proportion to the number of instructions processed per clock cycle (xe2x80x9cdegree of scalabilityxe2x80x9d). If, for example, a particular DSP architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the DSP is theoretically tripled.
These techniques are not mutually exclusive; DSPs may be both superpipelined and superscalar. However, operation of such DSPs in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of DSP resources, creating interruptions (xe2x80x9cbubblesxe2x80x9d or xe2x80x9cstallsxe2x80x9d) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the DSP ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the DSP""s architecture.
The speed at which a DSP can perform a desired task is also a function of the number of instructions required to code the task. A DSP may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a DSP can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized.
Among superscalar DSPs, some execute instructions in order (so-called xe2x80x9cin-order issuexe2x80x9d DSPs). In such DSPS, each instruction is written into the slots of a register within an instruction queue of an instruction logic circuit and marked with a xe2x80x9ctagxe2x80x9d to identify the order of the instructions. Typically, such tags are numerically arranged to specify only the order that the instructions are written into the registers, and not the order of execution of the instructions. At each clock cycle, one or more instructions within the registers are executed (xe2x80x9cgroupedxe2x80x9d) in accordance with grouping rules embedded in the DSP. After being grouped, if an instruction is no longer needed, it is simply overwritten (xe2x80x9cretiredxe2x80x9d).
Unfortunately, in even the most advanced DSPs found in the prior art, the re-ordering of instructions within the registers of an instruction logic circuit suffers from significant problems. If some instructions within the instruction register are grouped and retired in a given clock cycle in an order that differs from the order in which the instructions were originally written into the register, the remaining instructions are re-ordered within the individual slots of the register.
To illustrate this point, if four instructions are written into four consecutive slots, the instructions are conventionally identified by four consecutive tags associated with the slots. If only the first and third instructions are grouped and retired in a first clock cycle, the second and fourth instructions are re-ordered, or xe2x80x9cshifted,xe2x80x9d within the slots. More specifically, the second and fourth instructions are shifted into the first and second slots, and are thus associated with the first and second tags. Those skilled in the art understand that, since each instruction may comprise a number of data bits, shifting remaining instructions from slot to slot within a register so that they are associated with appropriate tags requires shifting a relatively large number of bits after each clock cycle.
The shifting of such large numbers of bits after each clock cycle typically leads to routing congestion within the instruction queue. In addition, shifting a large number of bits may also result in other routing problems, due primarily to a combination of the complexity of the routing circuit employed and the number of bits shifted. Of course, such routing congestion and other problems typically leads to undesired timing delay within the DSP.
Accordingly, what is needed in the art is an instruction queue for a DSP or other processor that consumes less power than those found in the prior art.
To address the above-discussed deficiencies of the prior art, the present invention provides for use in an instruction queue having a plurality of instruction slots, a mechanism for queueing and retiring instructions. In one embodiment, the mechanism includes a plurality of tag fields corresponding to the plurality of instruction slots, and control logic, coupled to the tag fields, that assigns tags to the tag fields to denote an order of instructions in the instruction slots. In addition, the mechanism includes a tag multiplexer, coupled to the control logic, that changes the order by reassigning only the tags.
In one embodiment of the present invention, the mechanism further includes loop detection logic, coupled to the control logic, that prevents ones of the instructions that are in a loop from being retired. In addition, the loop detection logic may prevent multiple instructions that are in a loop from being retired.
In one embodiment of the present invention, the mechanism further includes an input ordering multiplexer coupled to the control logic and to the plurality of instruction slots and configured to write the instructions into the plurality of instruction slots. In a related embodiment, the mechanism further includes an output ordering multiplexer coupled to the control logic and to the plurality of instruction slots and configured to retire the at least one of the instructions.
In one embodiment of the present invention, the instruction slots of the mechanism number six and the tags number six. Of course, the mechanism may include any number of instruction slots and tags.
In one embodiment of the present invention, the control logic further determines an order of at least one new instruction and causes the at least one new instruction to be written into at least one of the plurality of instruction slots. In another embodiment, the control logic determines an order of multiple new instructions and causes the multiple new instructions to be written into at appropriate corresponding instruction slots.
In one embodiment of the present invention, the control logic retires all of the instructions in response to receipt of a mispredict signal.
The foregoing has outlined, rather broadly, preferred and alternative features of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.