It is known to compress VLIW (Very Large Instruction Word) instructions. In order to reduce the code size it is common to use variable length instruction formats. However, the instruction fetch hardware required to handle such variable length instruction formats tends to become complicated. Various solutions have been applied to control the instruction fetch.
WO 98/02798 discloses a superscalar microprocessor having an instruction alignment unit. The instruction alignment unit transfers a fixed number of instructions from an instruction cache to each of a plurality of decode units. The instructions are selected according to a predecode tag including start-byte bits that indicate which bytes are the first byte of an instruction. The beginning of an instruction may thus be identified for the purpose of decompressing the compressed instructions.
U.S. Pat. No. 6,134,633 and WO 9743710 which relate to the Trimedia processor, there is disclosed a method of expanding variable length VLIWs at high clock speeds. The instruction decoding logic is pipelined, which means that the header must be available one cycle before the operation segments. Therefore, the header bits of a given variable length instruction are stored together with the variable length instruction preceding that specific instruction in program memory. However, this has special consequences for branch targets. Because a branch target can be entered from different origins, a unique preceding instruction cannot be determined. This would mean that decompression of a branch target can start only after the branch has been taken, resulting in loss of execution cycles. To prevent such a loss branch targets are not compressed, resulting in a high penalty in code size.
To obtain dense code for VLIW processors, VLIW instructions should preferably be compressed. This results in a variable number of instruction bits required per clock cycle. Fetching and decoding these instruction bits on time requires a decoder that can deduct from the instruction bits itself how many bits are required to decode the instruction, and consequently where the next instruction starts. This process involves a feedback loop in the decoder, where the decoding of the current instruction has impact on the decoding of the next instruction, namely the determination of the position in program memory where the next instruction is located.
The feedback loop potentially slows down processing, especially when the instructions are fetched batched in memory lines that can contain (parts of) more than one instruction. To determine when a new memory line should be (pre-)fetched, it is necessary to determine when the position in memory of the next instruction crosses a boundary between memory lines.
Speeding up the feedback loop by pipelining, i.e. by decompressing a succeeding instruction while executing the current instruction, is impossible since this would directly influence the rate at which instructions can be fetched. For example, adding one pipeline stage could mean that only one instruction per two clock cycles can be fetched, instead of one instruction every clock cycle.
Another disadvantage is that, in order to enable fetching of branch targets on time, a complete target instruction must fit on a single program memory line. (Instructions are regarded as having been fetched ‘on time’ when a new instruction may be taken into execution every cycle, that is, if the instruction is available for a first stage of processing (which includes decompression) as soon as the first stage of processing the preceding instruction has been completed). If the branch target is not compressed the probability is small that the instruction fits in a memory line when it follows another instruction in that memory line. Therefore such a target instruction is realigned to the start of another program memory line and padding bits are added in the original memory line, thereby increasing the code size even more.