Programmable processors can be general purpose processors or application-specific instruction-set processors. They can be used for manipulating different types of information, including sound, images and video. In case of application specific instruction-set processors, the processor architecture and instruction set is customized, which reduces the system's cost and power dissipation significantly. Processor architectures usually consist of a fixed data path, which is controlled by a set of control words. Each control word controls parts of the data path and these parts may comprise register addresses and operation codes for arithmetic logic units (ALUs) or other functional units. Each set of instructions generates a new set of control words, usually by means of an instruction decoder which translates the binary format of the instruction into the corresponding control word, or by means of a micro store, i.e. a memory which contains the control words directly. Typically, a control word represents a RISC like operation, comprising an operation code, two operand register indices and a result register index. The operand register indices and the result register index refer to registers in a register file.
In case of a Very Large Instruction Word (VLIW) processor, multiple instructions are packaged into one long instruction, a so-called VLIW instruction. A VLIW processor uses multiple, independent functional units to execute these multiple instructions in parallel. The processor allows exploiting instruction-level parallelism in programs and thus executing more than one instruction at a time. Due to this form of concurrent processing, the performance of the processor is increased. In order for a software program to run on a VLIW processor, it must be translated into a set of VLIW instructions. The compiler attempts to minimize the time needed to execute the program by optimizing parallelism. The compiler combines instructions into a VLIW instruction under the constraint that the instructions assigned to a single VLIW instruction can be executed in parallel and under data dependency constraints. In case no meaningful processing can take place in certain clock cycles for one or more functional units, a so-called no-operation (NOP) instruction is encoded in the VLIW instruction for that particular functional unit. In order to reduce the code size, and thus saving costs in terms of required memory size and in terms of required memory bandwidth, a compact representation of no-operation (NOP) instructions in a data stationary VLIW processor may be used, e.g. the NOP operations are encoded by single bits in a special header attached to the front of the VLIW instruction, resulting in a compressed VLIW instruction.
To control the operations in the data pipeline of a processor, two different mechanisms are commonly used in computer architecture: data-stationary and time-stationary encoding, as disclosed in “Embedded software in real-time signal processing systems: design technologies”, G. Goossens, J. van Praet, D. Lanneer, W. Geurts, A. Kifli, C. Liem and P. Paulin, Proceedings of the IEEE, vol. 85, no. 3, March 1997. In the case of data-stationary encoding, every instruction that is part of the processor's instruction-set controls a complete sequence of operations that have to be executed on a specific data item, as it traverses the data pipeline. Once the instruction has been fetched from program memory and decoded, the processor controller hardware will make sure that the composing operations are executed in the correct machine cycle. In the case of time-stationary coding, every instruction that is part of the processor's instruction-set controls a complete set of operations that have to be executed in a single machine cycle. These operations may be applied to several different data items traversing the data pipeline. In this case it is the responsibility of the programmer or compiler to set up and maintain the data pipeline. The resulting pipeline schedule is fully visible in the machine code program. Time-stationary encoding is often used in application-specific processors, since it saves the overhead of hardware necessary for delaying the control information present in the instructions, at the expense of larger code size.
Programmable processors, such as a VLIW processor, may still unnecessarily consume power during executing of NOP instructions. This problem related to NOP instructions is particularly present in parallel processors whenever these need to execute less parallel code, for example control dominated code. Especially in case of a highly parallel VLIW processor, this results in a large number of NOP instructions in the code, and hence unnecessary power consumption during execution of the NOP instructions. Traditionally, NOP instructions are encoded using a dedicated value recognised by the processing apparatus and not resulting in a change in the state of the processing apparatus. However, since the default code for NOP instructions obviously must be different from that of useful instructions, encoding NOP instructions in this way results in signal transitions and therefore unnecessary power consumption during execution of a NOP instruction, following or preceding a useful instruction. In order to decrease power consumption during the execution of NOP instructions a technique referred to as clock gating may be used, which shuts downs parts of the processor's datapath that are not used. The use of clock gating not only reduces the amount of power dissipated by unused sequential logic, but (pipeline) registers disabled by the clock gates will prevent signal transitions from rippling through unused combinatoric logic as well, and thus prevent further unnecessary power consumption. However, the latter is highly determined by the amount of (pipeline) registers present and the exact location of these registers. Low-power processors ideally have undeep pipelines to prevent the need for additional power-consuming hardware required to resolve adverse pipeline effects, such as long branch latencies. The latter holds in particular for processors where computational efficiency is crucial, since these processors are often highly parallel, ie. have many issue slots, and creating deep pipelines would add considerable hardware overhead in each issue slot. For reasons of minimising the amount of hardware, these highly parallel processors often use time-stationary instruction encoding to enable steering the vast number of hardware resources from a single highly parallel instruction without running into major instruction fetching and decoding bottlenecks.
U.S. Pat. No. 6,535,984 describes a power reduction technique for VLIW processors, based on the use of so-called proxy NOP instructions. The number of signal transitions caused by NOP instruction is reduced, by replacing a NOP instruction with the adjacent non-NOP instruction for the same issue slot of the VLIW processor, and at the same time making the guard of the substituted instruction equal to false, so that the decode circuitry does not send any execute/enable signals to the particular functional unit. These substituted instructions with false guards are named proxy NOP instructions. The described technique relies on the fact that a data stationary instruction encoding is used, where all information related to an instruction is encoded in a single atomic portion of a single VLIW instruction issued in a single processor cycle. Furthermore, this technique assumes that each issue slot in the VLIW processor supports guarding. Moreover, the technique assumes that every operation supported by any issue slot in the VLIW processor can be guarded, i.e. is conditional. However, this technique is unsuitable for time-stationary VLIW processors. First, in time-stationary encoding information related to a single instruction is typically spread across several VLIW instructions issued in different cycles, which means that information on NOP instructions corresponding to a single instruction is spread across multiple VLIW instructions. Second, instructions for time-stationary processors often do not encode operations as atomic entities. Instead, control information is encoded to directly steer processor resources, such as functional units, register files, bus multiplexers etc. This decoupling of “abstract” instruction information from actual resource steering, allows techniques such as multicasting where the result of a single operation can optionally be written to multiple register files in a single processor cycle. For example, in data-stationary encoding, write back information, i.e. control information to write back result data into to the register file, is normally encoded in separate instruction fields per operation result. Each field in this case contains a destination register address (register file, register index) specifying the register in which the corresponding result should be written. In cases where the same result is to be written into multiple register files, multiple destination register addresses to be encoded in multiple fields per operation result would be required. This is usually not supported in a data-stationary instruction format, because no efficient encoding exists, especially if the number of destinations to be receiving the same result can vary. Alternatively, separate instructions need to be added to a program to explicitly copy a result to other register files. Time-stationary encoding allows the use of separate fields to encode write back information per register file write port, rather than per operation result. Hence, rather than specifying per operation result in which register files a result should be written, one can specify per register file write port which operation result should be selected to be written into the register file. With this concept the same result can be written to an arbitrary number of register files in a single cycle, without impacting the number of instruction fields required. As a result of this decoupling, the same field in a time-stationary instruction can carry information corresponding to operations executed on different issue slots in different clock cycles. A given register file write port field in an instruction issued at cycle i+2 (i=0, 1, 2 . . . ) may select a result produced by a first issue slot as the result of an instruction issued two cycles earlier in cycle i, whereas in the next instruction issued at cycle i+3 it may select a result produced by a second issue slot as the result of the instruction issued one cycle earlier at cycle i+2. Hence, one cannot identify a single group of instruction bits per instruction that encodes all control information belonging to a single complete NOP operation.
It is therefore a disadvantage of the prior art method of reducing power usage by a VLIW processor, that this method can not be used for time-stationary processors.