In a pipelined VLIW processor, instructions are carried out in parallel and each instruction is executed in sub-steps. As a result, several consecutive instructions of a program, each at a different stage, can be executed simultaneously. A VLIW system may utilize a complier that checks for dependencies among instructions of a program, and accordingly, determines the order of execution of instructions including which instructions can be executed in parallel. However, existing compilers are not configured to generate optimal programs for such a structure. As a result, programmers write VLIW programs including instructions tailored to run in parallel across multiple processing units of a VLIW processor. Typical programming methodologies entail determining the order of execution of instructions in advance, and accurately predicting the availability of desired input data at the processing units. It may also be necessary to predict the availability and processing load of each processing unit, as different processing units may handle instructions of different sizes (i.e., different numbers of sub-steps). In these cases, “No Operation” (NOP) instructions are inserted in a program to synchronize the load at multiple processing units, but the use of NOP instructions decreases program density and results in a sub-optimal program code, which, in turn, may warrant the use of code-compression techniques. In general, programming a VLIW processor with multiple processing units is complicated because the correct data must be at the correct place (i.e., processing unit) at the correct time.
In addition, conventional VLIW architectures include one or more instruction queues commonly shared by a plurality of processing units to fetch instructions, and a plurality of data queues, each of which is assigned to only one processing unit to read and write data. Unfortunately, such architectures result in a slower program execution, as they do not provide flexibility in terms of reducing the time to fetch instructions, nor do they dynamically utilize a plurality of data queues to read data and write data.