1. Technical Field
The embodiments herein generally relate to looping functionality of VLIW processor and more particularly to a hardware looping mechanism configured to provide Software Pipelined loop with zero-overhead which executes large chunk of instructions with very small buffer depth.
2. Description of the Related Art
A typical processor involves various functional units and the processor performance is often increased by overlapping the steps of multiple instructions, using a technique called pipelining. Software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining.
To pipeline instructions, the various steps of instruction execution may be also performed by independent units called “pipeline stages”. The result of each pipeline stage is communicated to the next pipeline stage via a register (or latch) arranged between two stages. In most cases, pipelining reduces the average number of cycles required to execute a task.
Some of the architectures attempting to improve performance by exploiting instruction parallelism include very-long-instruction-word (VLIW) processors and super-pipelined processors. VLIW processors increase processor speed by scheduling instructions in software rather than hardware. In addition, VLIW and superscalar processors can each be super-pipelined to reduce processor cycle time by dividing the major pipeline stages into sub-stages. This can then be clocked at a higher frequency than the major pipeline stages.
Many electronics devices are now embedded with digital signal processors (DSPs), or specialized processors that have been optimized to handle signal processing algorithms. DSPs may be implemented as either scalar or superscalar architectures, and may have several features in common with RISC-based counterparts. An efficient looping mechanism, in particular, is often critical in digital signal processing applications because of the repetitive nature of signal processing algorithms.
In order to minimize the execution time required for looping, some DSP architectures may support zero-overhead loops by including dedicated internal hardware (also referred to as a “hardware looping mechanism). These hardware looping mechanisms may be included for monitoring loop conditions and to decide in parallel with all other operations whether to increment the program counter, or branch without cycle-time penalty to the top of the loop. Unlike conventional RISC processors, which may implement a “test-and-branch” at the end of every loop iteration, DSP architectures with zero-overhead looping mechanisms require no additional instructions to determine when loop iteration has been completed.
For instance, typically DSP architectures provide a zero-overhead looping on a single instruction or multiple instructions. However, these looping mechanisms provide extremely limited flexibility. Typical DSP CPU Architectures provide Zero Overhead Looping, by having a dedicated hardware like loop buffer of significant size. These loop buffer can hold the block of instructions that need to be executed in the loop, only to the limit as allowed by the instruction buffer size and is strictly dependent on it. This poses a problem when there is a need to implement certain application kernels, which require large loops exceeding this limit.