In multiple processor systems, various mechanisms have been provided for scheduling instructions in the parallel processors. The goal of any such scheduling mechanism is to assure that the many processors are kept busy through operating cycles to make full use of the available hardware.
One approach to scheduling parallel operations is based on the use of very long instruction words (VLIW), each of which is able to identify multiple operations to be performed in parallel. An advantage of such systems is that the parallelism can be scheduled once by the compiler rather than during run time with each execution of the program. Where data dependencies allow, the compiler schedules plural operations in the VLIW for simultaneous execution. However, compile time scheduling is limited by unpredictable memory latencies and by some dependencies, such as data dependent array references, which cannot be statically determined. Furthermore, branch boundaries tend to limit the number of operations that can be scheduled simultaneously. Consequently, applications exhibit an uneven amount of instruction level parallelism during their execution. In some parts of a program, all of the function units will be used, while in others serial computations with little instruction level parallelism dominate. Further, the amount of available parallelism depends on both the computation on hand and the accessibility of data. Long memory latencies can stifle the opportunities to exploit instruction level parallelism.
Another scheduling technique which can be performed by a compiler is that of separating the program into threads of computation which can be directed to separate processors. These systems suffer periods of time during which an insufficient number of threads can be identified for filling the plural processing units.
A multithreading approach combines the above two approaches with some success. In that approach, the VLIW approach is supplemented with the ability to switch between threads during periods of long idle latency of a thread being processed. Thus, the system maintains the fine instruction level parallelism, but fills the periods of latency with VLIW processing of other threads. However, the system still suffers from the problem of non-uniform instruction level parallelism within each thread.