1. Field of the Invention
The present invention is generally in the field of digital signal processing (xe2x80x9cDSPxe2x80x9d) and central processing units. In particular, the invention is in the field of very long instruction word (xe2x80x9cVLIWxe2x80x9d) processors.
2. Background Art
VLIW processors differ from general conventional processors. One primary difference is that VLIW processors use very long instruction words which are, simply stated, a combination of instructions which are generally handled concurrently by the processor. Examples of various types of instructions are arithmetic instructions, logical instructions, branch instructions, or memory associated instructions. Each instruction type is usually assigned to one or two specific logic units for its execution (each such logic unit is appropriately called an xe2x80x9cexecution unitxe2x80x9d). A VLIW xe2x80x9cpacketxe2x80x9d of instructions (also referred to as a xe2x80x9cVLIW instruction packetxe2x80x9d or an xe2x80x9cinstruction packetxe2x80x9d in the present application) usually includes, in addition to the combination of instructions referred to above, other information which is needed for processing that particular combination of instructions. For example, a VLIW packet may include instructions to multiply and accumulate data from two arrays of numbers, together with an instruction indicating that the multiply and accumulate instructions are to be repeated a certain number of times.
Execution of a computer program including VLIW packets residing in the computer""s main memory (also referred to as the xe2x80x9cexternal memoryxe2x80x9d in the present application) requires fetching each VLIW packet from the computer""s main memory into the processor (also referred to as a xe2x80x9ccentral processing unitxe2x80x9d or xe2x80x9cCPUxe2x80x9d). The larger the program currently being used, the more often instructions must be fetched. This fetching process requires a certain number of clock phases and consumes a certain amount of power to transfer the instruction over the computer""s internal data lines (also referred to as a xe2x80x9cbusxe2x80x9d). Therefore, the more often instructions have to be fetched from main memory, the less time the processor has available to decode and execute those instructions and the slower the speed at which the processor can finish tasks.
Furthermore, VLIW packets, which typically may be 128 bits or 256 bits long, are much longer than individual instructions, which are typically 32 bits long, used in conventional non-VLIW processors. The long VLIW packets require a greater number of interconnect lines to transfer all the individual instructions in the VLIW packet, that is, the xe2x80x9cinstruction busxe2x80x9d must be wider than that used in conventional non-VLIW processors. A wider bus consumes proportionately more power in direct relation to the increased width of the bus. Power consumption must be budgeted for the processor in order to avoid problems associated with excess power consumption, for example, overheating, which can lead to hardware failure. Therefore, the more often instructions have to be fetched from main memory, the more power the processor consumes fetching those instructions and the less the power available for the processor to perform other tasks.
Thus, it is desirable to set aside in a local memory, i.e. a memory requiring less time and less power to access than the main memory, a limited number of program instructions that the processor may want to fetch. An instruction cache is such a local memory. An instruction cache is a relatively small memory module where a limited number of program instructions may be stored. The processor performs constant checks to determine whether instructions stored in the main memory required by the processor are already resident in the instruction cache. If they are already resident in the instruction cache, the instruction fetch step is performed by referring to the instruction cache, since there is no need to go to the main memory to find what is already in the instruction cache.
The instruction cache approach is inadequate for a number of specific applications, such as digital signal processing or DSP, where repetition of blocks of instructions (referred to as xe2x80x9cinstruction loopsxe2x80x9d or xe2x80x9crepeat loopsxe2x80x9d) is frequently encountered. For digital signal processing, as an example, it is estimated that 80% of processor execution time is spent executing short repeat loops. Short repeat loops commonly occur, for example, in the xe2x80x9cbutterflyxe2x80x9d portion of many Fast Fourier Transform (xe2x80x9cFFTxe2x80x9d) algorithms, which are frequently used in digital signal processing.
Execution of a repeat loop, and in particular a short repeat loop, requires refetching each instruction before it is repeated. The constant refetching of repeated instructions consumes a substantial amount of processor time and, in view of the special considerations of bus width in VLIW processors, a substantial amount of power, even with the use of local memory techniques, such as instruction cache.
Therefore, there is a need in the art for avoiding refetching of VLIW packets which occur in short repeat loops. Also, there is a need in the art for avoiding needlessly refetching any instruction which has recently been executed in a VLIW processor, whether or not the instruction occurs in a short repeat loop. Further, there is need in the art for reducing the power consumed by instruction fetching in VLIW processors. Moreover, there is need in the art for reducing the power consumed by instruction fetching in a VLIW processor while maintaining or improving the speed and performance of the processor.
The present invention is directed to method for reducing power when fetching instructions in a processor and related apparatus. The invention overcomes the need in the art for avoiding refetching of VLIW packets which occur in short repeat loops. The invention avoids needlessly refetching any instruction which has recently been executed in a VLIW processor, whether or not the instruction occurs in a short repeat loop. Further, the invention reduces the power consumed by instruction fetching in VLIW processors while maintaining or improving the speed and performance of the processor.
According to the invention an instruction loop having at least one instruction is identified. For example, each instruction can be a VLIW packet comprised of several individual instructions. The instructions of the instruction loop are fetched from a program memory, which can be, for example, a cache or an external memory. The instructions are then stored in a register queue. For example, the register queue can be rot implemented with a head pointer which is adjusted to select a register of the register queue in which to write each instruction that is fetched.
It is then determined whether the processor requires execution of the instruction loop. When the processor requires execution of the instruction loop, the instructions are output from the register queue. For example, the register queue can be implemented with an access pointer which is adjusted to select a register of the register queue from which to output each instruction that is required. The instructions are then passed to an instruction decode unit for decoding and execution.