The performance of conventional microarchitectures, measured in Instructions Per Cycle (IPC), has improved by approximately 50-60% per year. This growth has typically been achieved by increasing the number of transistors on a chip and/or increasing the instruction cycle clock speed. However, these results will not continue to scale with respect to future technologies (90 nanometers and below), because fundamental pipelining limits and wire delays bind such architectures to their data communications systems.
Instruction-Level Parallelism (ILP), which can describe methods of using multiple transistors, also refers to a process where multiple instructions are executed in parallel, and constitutes yet another path to greater computational performance. One approach to increasing the use of ILP is via conventional superscalar processor cores that detect parallelism at run-time. The amount of ILP that can be detected is limited by the issue window, the complexity of which grows as square of the number of entries. Conventional superscalar architectures also rely on frequently accessed global structures, slowing down the system clock or increasing the depth of the pipeline.
Another approach to the implementation of parallel processing is taken by VLIW machines, where ILP analysis is performed at compile time. Instruction scheduling is done by the compiler, orchestrating the flow of execution in a static manner. However, this approach works well only for statistically predictable codes, and suffers when dynamic events occur—a run-time stall in one function unit, or a cache miss, forces the entire machine to stall, since all functional units are synchronized. Thus, there is a need for new computational architectures that capitalize on the transistor miniaturization trend while overcoming communications bottlenecks.