The present invention generally relates to very long instruction word digital signal processors. More particularly, the present invention relates to a prefetch instruction buffer in a very long instruction word digital signal processor.
In a very long instruction word digital signal processor, it is important to minimize power consumption and execution time. In a digital signal processor employing a very long instruction word architecture, or another type of architecture, instructions are fetched from storage in memory, whether it be main memory or instruction cache, then routed along instruction buses and placed in instruction registers where the instructions are decoded for execution. Because very long instruction word instruction packets are long (typically 128 or 256 bits) the instruction buses are correspondingly wide to accommodate the size of the instruction packets.
A significant amount of power is consumed in driving these instruction buses. The amount of power consumed in moving a large block of instructions from memory depends in part on the size of the block. For example, less power will be consumed in moving a block of instructions from memory that is 1 kilobyte (kB) as opposed to a block of instructions that is 500 kB. In addition, the time needed to execute instructions is typically much less than the time needed to retrieve instructions from memory.
Instruction caches have been used in digital signal processors for many years to address problems associated with disparate processor and memory speeds. An instruction cache is a smaller amount of memory that runs at a high speed. To speed up the execution of instructions, instructions may be fetched from main memory prior to execution and placed into an instruction cache. When the instructions are needed, they can be retrieved from the instruction cache rather than from main memory, which takes significantly less time and consumes less power. For this reason, an instruction cache may be considered a latency hiding mechanism: the latency associated with fetching particular instructions from main memory at the moment that such instructions are needed is hidden by fetching such instructions from main memory prior to the time they are needed and placing them in a more quickly accessible location.
In a typical application, an instruction cache is implemented as a multi-level storage system. An instruction cache is arranged such that one level is examined or searched at a time, from the lowest level to the highest level, for a particular instruction or set of instructions. The lowest level of instruction cache is generally referred to as L1 or first level cache and is typically the fastest cache and consumes the least power. An L1 cache is typically located closest to the processor. An L1 cache generally is a small memory block and may be about 4 kB or 8 kB in size, although other sizes could be used. If the particular instructions sought are located in the first level cache, the instructions are routed to the instruction register for execution. If the particular instructions are not located in the first level cache, the next (second) level cache is then searched, and so forth, up to the highest level cache and then to main memory. The levels may be arranged in a manner such that higher levels of cache encompass lower levels of cache. As the level of instruction cache increases, so does the size of the cache and the power consumed in searching the cache.
In addition, the time needed to access a higher level cache (xe2x80x9caccess cycle timexe2x80x9d) is greater than for accessing a lower level cache. The access cycle time is the time required for a computer to locate, retrieve and route instructions to an instruction register. The access cycle time of a higher level cache is a multiple, but not necessarily an integer multiple, of the access cycle time for a lower level cache. For example, if it xe2x80x9ccostsxe2x80x9d one cycle to access a first level cache, it might cost four cycles to access a second level cache, eight cycles to access a third level cache and so forth. The multiples are a result of many design parameters, some of which include the size of the caches, the speed of the memory, and the components used.
Thus, depending upon where particular instructions are stored, even though an instruction cache helps to minimize the disparity between processing speeds and memory speeds, a significant amount of time and power may be consumed in locating particular instructions.
Moreover, in a digital signal processor, a large percentage of processing time may be spent in executing inner loops, repeat loops, also known as zero-overhead loops, and frequently-called functions. Minimizing the power consumed and processing time expended by such loops and functions is a goal in designing digital signal processors. This goal is especially difficult to achieve when using a very long instruction word architecture because of the length of the instruction packet. Therefore, there is a need for a very long instruction word architecture digital signal processor that uses, stores, fetches, routes, and manipulates very long instructions, while minimizing the power consumed and the processing time expended in performing such functions.
The present invention overcomes problems in bussing blocks of very long instruction word instructions from a storage in a memory unit to execution in a functional unit by placing such instructions in a small, fast, low power consumptive prefetch instruction buffer. These instructions include repetitive instruction blocks, such as loops, or frequently used functions (e.g., Fast Fourier Transforms). A signal, such as a prefetch instruction flag bit, is generated and indicates that a particular block of instructions is to be retrieved from the prefetch instruction buffer rather than from the instruction cache. The prefetch instruction flag bit is sent to a prefetch instruction buffer control unit which facilitates copying the instructions from a memory unit to the prefetch instruction buffer. The prefetch instruction buffer control unit also sets a select input flag bit on a multiplexer. The multiplexer receives at least two inputs, one from the prefetch instruction buffer and one from instruction cache and is designed to output an input based on the select input flag bit. The select input flag bit instructs the multiplexer to output the input it receives from the prefetch instruction buffer. The instructions are then transmitted to an instruction register where they are decoded and sent to appropriate functional units for execution. Once all the instructions in the particular block of instructions have been executed, the next block of instructions is processed either from the instruction cache or again from the buffer, as directed. Placing blocks of instructions in a small, fast buffer minimizes the time expended and power consumed in fetching long instructions.