In a VLIW processor, a typical problem is that it is difficult to make effective use of the full capabilities of the fixed length VLIWs available in the hardware. In previous designs, this design problem led to a very porous VLIW memory containing many No Operation (NOP) instructions within the VLIWs. Some machines have attempted to encode the NOPs to more fully utilize the VLIW memory space. One motivation of such attempts was to make better use of the costly VLIW memory included in these earlier processors. The encoded NOPs were typically assigned to each specific VLIW with no reuse of the VLIW possible in different areas of the program.
There are other needs to be met by a VLIW parallel data processor. For example, it is desirable to pipeline operations in order to achieve a steady state flow of data for maximum throughput. Consider the case of matrix multiplication using a VLIW architecture with four short instruction words (SIWs) per VLIW. In the example of FIG. 1, a 4-element vector 2 and a 4×4 matrix 4 are multiplied. Given a processor with operands stored in a register file and VLIW execution units that operate on register file source data operands and deliver result data to the register file, it can be reasonably assumed that the vector elements are stored in data registers R20=a0, R21=a1, R22=a2, and R23=a3, and the 4×4 matrix 4 is stored in a processor accessible memory. FIG. 2 illustrates how the entire operation is handled in a typical prior art approach. Each row in table 10 represents a unique short instruction word (SIW) or VLIW instruction with the program flow beginning at the top of the table and proceeding time-wise down the page. The Load operation is an indexed load that incrementally addresses memory to fetch the data element listed and load it into the specified register R0 or R1. The Add and Mpy instructions provide the function Rtarget=Rx Operation Ry, where Rtarget is the operand register closest to the function name and the source operands Rx and Ry are the second and third register specified. Each unique VLIW memory address is identified with a number in the first column. The table 10 of FIG. 2 shows that a minimum of seven VLIWs, each stored in a unique VLIW memory address, and three unique SIWs, are required to achieve the desired results in the prior art. It is important to note that of the seven VLIWs, three VLIWs, namely numbers 1, 2, and 7, use only two SIWs per VLIW, the other four use three SIWs per VLIW. When a four instruction slot VLIW contains only two SIWs, the other two slots contain NOP instructions. When the four instruction slot VLIW contains three SIWs, the other slot contains a single NOP. With a five instruction slot VLIW as will be described in greater detail below, even poorer usage of the VLIW memory results using prior art techniques. In the vector matrix example, a five slot VLIW will use 7*5=35 VLIW memory locations with 17 NOPs assuming the fifth slot is not used for this matrix multiplication example. The prior art approach results in a very porous VLIW memory with numerous NOP instructions.
It is desirable to reduce the number of unique VLIW memory addresses to accomplish the same task since this makes more efficient use of the available hardware. It is also desirable to reduce duplicate instructions in the VLIW memory storage. This is an important consideration that allows a smaller VLIW memory to be designed into a processor thereby minimizing its cost. Further, if the same VLIW memory address could be shared by multiple sections of code and even multiple programs then the latency cost of loading the VLIW memories can be minimized, as compared to prior art approaches, and amortized over the multiple programs thereby improving overall performance. In addition, it is desirable to extend this concept into multiple Processing Elements (PEs) and to a controller Sequence Processor (SP) of a Single Instruction Multiple Data stream (SIMD) machine