Application programs can be considered to comprise sequential code segments with low levels of instruction level parallelism (ILP), such as control code, and code segments with high levels of ILP which are referred to herein as parallel code. Both of these code types are intermixed in an application program and both need to be efficiently processed to achieve high performance for the whole program. The case of pure sequential code with no available ILP and pure parallel code with no control constructions such as branches does not typically exist in a whole program application except possibly for very small code segments. In order to develop a high performance processor that does well on both sequential and parallel code, it is important to consider how to support small levels of instruction parallelism in sequential code, how to minimize the latency to support parallel execution, how to more flexibly support parallel execution, and how to improve code density.
An indirect VLIW processor, such as the BOPS, Inc. Manta and Mobile Media Processor (MMP) both subsets of the ManArray architecture, use an execute VLIW (XV) indirect instruction mechanism for accessing multiple instruction VLIWs for execution to achieve high levels of selectable parallelism. The expense of using the indirect VLIWs in the Manta and MMP is primarily a load VLIW latency associated with loading VLIWs into local VLIW memories (VIMs). The Manta and MMP use a Load VLIW (LV) instruction to load VLIWs into local VIMs where the load latency is equal to the number of instructions to be loaded plus one for the LV. For a five issue VLIW, the LV latency is 6-cycles. If a specific VLIW utilization is low as would be the case in sequential code where ILP is low, then VLIWs would typically not be used due to the increased overhead load flowing from VLIW latency. For example, to store a two-issue VLIW in VIM would cost two instructions plus the LV and to execute the two-issue VLIW would cost one additional cycle for the XV for a total of 4 cycles. If the code segment was executed without use of the VLIW, it would cost only 2 cycles. For the case of a three-issue VLIW, it would cost 3(instructions)+1(LV)+1(XV)=5 cycles as compared to a cost of executing the code directly of only 3 cycles. The indirect VLIWs were designed to support high usage VLIWs such as found in digital signal processing (DSP) type loops where the load latency is essentially insignificant and the overall performance gain is very high. Utilizing a Manta, an MMP, or similar indirect VLIW architecture to mine the available ILP in sequential code, however, is not cost effective.