A digital signal computer, or digital signal processor (DSP), is a special purpose computer that is designed to optimize performance for digital signal processing applications, such as, for example, fast Fourier transforms, digital filters, image processing, signal processing in wireless systems, and speech recognition. Digital signal processors are typically characterized by real-time operation, high interrupt rates and intensive numeric computations. In addition, digital signal processor applications tend to be intensive in memory access operations and to require the input and output of large quantities of data. Digital signal processor architectures are typically optimized for performing such computations efficiently.
Digital signal processors may include components such as a core processor, a memory, a DMA controller, an external bus interface, and one or more peripheral interfaces on a single chip or substrate. The components of the digital signal processor are interconnected by a bus architecture which produces high performance under desired operating conditions. As used herein, the term “bus” refers to a multiple conductor transmission channel which may be used to carry data of any type (e.g. operands or instructions), addresses and/or control signals. Typically, multiple buses are used to permit the simultaneous transfer of large quantities of data between the components of the digital signal processor. The bus architecture may be configured to provide data to the core processor at a rate sufficient to minimize core processor stalling.
Some processors have hardware loop mechanisms. These mechanisms allow the specification of a program loop, which is a series of instructions that is executed multiple times. The instructions are often referred to by address.
In a variable instruction width processor, instruction width is encoded within several bits of the instruction. Therefore, when executing a series of sequential instructions, the address of a given instruction cannot be determined until the previous instruction has been decoded. The previous instruction address added to the previous instruction width results in the address of the new instruction.
In modern pipelined processors, several cycles are often required to fetch an instruction after the address is known. A cycle after the request to fetch is sent, the address is incremented by a fixed amount and another request is sent to fetch the next sequential block of memory. This describes a typical pipelined fetch unit.
Each cycle, the fetch unit can return a block of memory which contains chunks of instructions. Each block of memory may contain several small instructions, or only a piece of a large instruction. An instruction assembly unit sends the next few pieces of instruction data to the main processor pipeline. As the processor detects the size of the current instruction (by looking at the instruction width bits), it tells the instruction assembly unit how to skip to the next instruction (for the next cycle).
For a processor to detect a program loop, a comparison is made between the current instruction address and the “loop bottom” address. If the addresses are the same, then the instruction is the last in a loop, so the “loop top” address should be sent to fetch. It is desirable to know about the address match early in the fetch pipeline (i.e. as soon as possible after the loop bottom instruction has been sent to the fetch unit) so as to start the loop top fetch as early as possible.
In a variable instruction width processor, it is never possible to know the loop end condition in time to slot the next fetch in the pipeline. Therefore, a structure called a “loop buffer” is commonly used to cache a few instructions at the top of a loop so that they can be injected into the instruction pipeline after a loop bottom is detected but before the fetch for the loop top has completed. (When this loop buffer is engaged, the fetch address requested is not the loop top address, but rather the loop top address plus the width of the instructions in the loop buffer.)
Because the address of the current instruction is not known until it enters the main processor pipeline, no loop comparison can be done until this point. The amount of work to accomplish the loop comparison, compute the next fetch address, and notify the loop buffer to inject instructions takes longer than a cycle. If the loop buffer is not-notified in time, then the processor must stall.
A loop is specified by a “loop setup” instruction, which specifies relative address offsets for the top and bottom of the loop (TOP_OFFSET, BOT_OFFSET) and a loop count. If the address of the loop setup instruction is called “PC_LSETUP”, then the address of the loop top is (PC_LSETUP+TOP_OFFSET), and the address of the loop bottom is (PC_LSETUP+BOT_OFFSET).
The instruction directly following the loop setup instruction may be a loop bottom instruction. It can be determined directly from the loop setup instruction decode if this is the case (i.e. if BOT_OFFSET equals the width of the loop setup instruction). No address comparison is required for determining the loop bottom status of the instruction following the loop setup instruction.
However, determining the loop bottom status of the second and later instructions following the loop setup instruction is not as simple. It is desirable to know the result of this computation as soon as possible.
Accordingly, there is a need for improved methods and apparatus for early loop bottom detection in digital processors.