1. Field of the Invention
This invention relates to processor design and, more particularly, to a hardware looping mechanism configured to provide zero-overhead looping when executing any number and/or type of discontinuity instruction.
2. Description of the Related Art
The following descriptions and examples are not admitted to be prior art by virtue of their inclusion within this section.
A typical processor involves various functional units that receive instructions from, for example, memory and operate on those instructions to produce results that are stored back into the memory or dispatched to an input/output device. To operate on a single instruction, a processor may fetch and decode the instruction, assemble its operands, perform the operations specified by the instruction and write the results back to memory. The execution of instructions may be controlled by a clock signal, whose period may be referred to as the “processor cycle time”.
The amount of time taken by a processor to execute a program may be determined by several factors including: (i) the number of instructions required to execute the program, (ii) the average number of processor cycles required to execute an instruction, and (iii) the processor cycle time. Processor performance may be improved by reducing one or more of the above-mentioned factors. For example, processor performance is often increased by overlapping the steps of multiple instructions, using a technique called “pipelining.” To pipeline instructions, the various steps of instruction execution are performed by independent units called “pipeline stages”. The result of each pipeline stage is communicated to the next pipeline stage via a register (or latch) arranged between two stages. In most cases, pipelining reduces the average number of cycles required to execute an instruction by permitting the processor to handle more than one instruction at a time.
Many types of pipelined processors are currently available. For example, some processors may be classified as either complex-instruction-set computer (CISC) or reduced-instruction-set computer (RISC) processors. In CISC architectures, processor performance may be improved by reducing the number of instructions required to execute a program, while increasing the average number of cycles taken to decode and execute the (densely encoded) instructions. On the other hand, RISC architectures attempt to improve processor performance by reducing the number of cycles taken to execute an instruction, while allowing some increase in the total number of instructions. Though CISC and RISC architectures may improve processor performance to some degree, they are often limited to issuing only one instruction into the pipeline at a time. Such processors are referred to herein as “single-issue” or “scalar” processors.
Superscalar processors have been developed to reduce the average number of processor cycles per instruction (beyond what was possible in pipelined, scalar processors) by allowing concurrent execution of instructions in the same pipeline stage, as well as concurrent execution of instructions in different pipeline stages. Instead of issuing only one instruction per processor cycle, “superscalar” or “multi-issue” processors were given multiple pipelines, so that two or more instructions could be fed through the pipeline stages in parallel. The number of instructions that can be issued into the pipeline at any one time is often referred to as the “issue width” of the processor. In most cases, multi-issue processors may execute approximately 2 to N instructions at a time.
Other architectures attempting to improve performance by exploiting instruction parallelism include very-long-instruction-word (VLIW) processors and super-pipelined processors. VLIW processors increase processor speed by scheduling instructions in software rather than hardware. In addition, VLIW and superscalar processors can each be super-pipelined to reduce processor cycle time by dividing the major pipeline stages into sub-stages, which can then be clocked at a higher frequency than the major pipeline stages. As used herein, the term “superscalar processors” will be considered to include superscalar processors, VLIW processors and super-pipelined versions of each.
Many electronics devices are now embedded with digital signal processors (DSPs), or specialized processors that have been optimized to handle signal processing algorithms. DSPs may be implemented as either scalar or superscalar architectures, and may have several features in common with RISC-based counterparts. However, the differences between DSP and RISC architectures tend to be most pronounced in the processors' computational units, data address generators, memory architectures, interrupt capabilities, looping hardware, conditional instructions and interface features.
An efficient looping mechanism, in particular, is often critical in digital signal processing applications because of the repetitive nature of signal processing algorithms. In order to minimize the execution time required for looping, some DSP architectures may support zero-overhead loops by including dedicated internal hardware, otherwise referred to as a “hardware looping mechanism.” These hardware looping mechanisms may be included for monitoring loop conditions and to decide—in parallel with all other operations—whether to increment the program counter, or branch without cycle-time penalty to the top of the loop. Unlike conventional RISC processors, which may implement a “test-and-branch” at the end of every loop iteration, DSP architectures with zero-overhead looping mechanisms require no additional instructions to determine when a loop iteration has been completed.
Zero-overhead looping mechanisms are currently provided in a variety of scalar DSP architectures. For example, some DSP architectures may provide zero-overhead looping on a single instruction (using, e.g., a REPEAT loop construct) or on multiple instructions (using, e.g., a DO loop construct). However, these looping mechanisms provide extremely limited flexibility, in that they apply only to loop instructions and not to other discontinuity instructions, such as conditional branch instructions (like the BNZ or “branch if not zero” instruction). As used herein, a “discontinuity instruction” may refer to any instruction that diverts program control away from the next instruction immediately following the discontinuity instruction in program sequence. In addition, currently available looping mechanisms do not allow branch instructions to be placed near the end of a loop, nor do they allow program control to branch back into the loop if another discontinuity instruction is encountered outside of the loop. These constraints further limit the flexibility of currently available hardware looping mechanisms.
To date, the inventors are unaware of any zero-overhead looping mechanisms currently available for use within superscalar processors. Instead, a branch-style looping construct, referred to as the Again (AGN) instruction, is often used to determine whether a loop iteration has been completed. In conventional architectures, the AGN instruction is re-issued into the pipeline for each new iteration of the loop. Unfortunately, re-issuing the AGN instruction reduces the issue width of multi-issue processors by consuming at least one instruction slot for each iteration of the loop.
Therefore, a need exists for an improved zero-overhead looping mechanism for both scalar and superscalar processor architectures. Such a looping mechanism would provide true zero-overhead looping by maintaining a maximum issue width at all times. In addition to loop instructions, an improved looping mechanism could be applied to other types of discontinuity instructions, such as conditional branch instructions. An improved looping mechanism would also be configured to support substantially any number of nested loops, in addition to hardware/software interrupts and other branch instructions that cause program control to be diverted outside of the loop.