It is widely accepted that Moore's Law growth in available transistors will continue for the next decade. Recent research, however, has demonstrated that simply scaling up current architectures will not convert these new transistors to commensurate increases in performance. This gap between the performance improvements that are needed and those that can be realized by simply constructing larger versions of existing architectures will fundamentally alter processor designs.
Three problems contribute to this gap, creating a processor scaling wall. The problems include the ever-increasing disparity between computation and communication performance—fast transistors, but slow wires; the increasing cost of circuit complexity, leading to longer design times, schedule slips, and more processor bugs; and the decreasing reliability of circuit technology, caused by shrinking feature sizes and continued scaling of the underlying material characteristics. In particular, modern superscalar processor designs will not scale, because they are built atop a vast infrastructure of slow broadcast networks, associative searches, complex control logic, and inherently centralized structures that must all be designed correctly for reliable execution. Like the memory wall, the processor scaling wall has motivated a number of research efforts. These efforts all augment the existing program counter-driven von Neumann model of computation by providing redundant checking mechanisms (see for example, the work by T. M. Austin, “DIVA: A reliable substrate for deep submicron microarchitecture design,” International Symposium on Microarchitecture, 1999); exploiting compiler technology for limited dataflow-like execution, as disclosed by R. Nagarajan et al., “A design space evaluation of grid processor architectures,” International Symposium on Microarchitecture, 2001; or efficiently exploiting coarse grained parallelism, as proposed by K. Mai et al., “Smart memories: A modular reconfigurable architecture,” International Symposium on Computer Architecture, 2002, or as disclosed by E. Waingold et al., “Baring it all to software: Raw machines,” IEEE Computer, vol. 30, no. 9, 1997.
A Case for Exploring Superscalar Alternatives
The von Neumann model of execution and its most sophisticated implementations, out-of-order superscalars, have been a phenomenal success. However, superscalars suffer from several drawbacks that are beginning to emerge. First, their inherent complexity makes efficient implementation a daunting challenge. Second, they ignore an important source of locality in instruction streams; and third, their execution model centers around instruction fetch, an intrinsic serialization point.
As features and cycle times shrink, the hardware structures that form the core of superscalar processors (register files, issue windows, and scheduling logic) become extremely expensive to access. Consequently, clock speed decreases and/or pipeline depth increases. Indeed, industry recognizes that building ever-larger superscalars as transistor budgets expand can be impractical, because of the processor scaling wall. Many manufacturers are tuning to larger caches and chip multiprocessors to convert additional transistors into increased performance without impacting cycle time.
To squeeze maximum performance from each core, architects constantly add new algorithms and structures to designs. Each new mechanism, optimization, or predictor adds additional complexity and makes verification time an ever increasing cost in processor design. Verification already consumes about 40% of project resources on complex designs, and verification costs are increasing.
Untapped Locality
Superscalars devote a large share of their hardware and complexity to exploiting locality and predictability in program behavior. However, they fail to utilize a significant source of locality intrinsic to applications, i.e., dataflow locality. Dataflow locality is the predictability of instruction dependencies through the dynamic trace of an application. A processor could take advantage of this predictability to reduce the complexity of its communication system (i.e., register files and bypass networks) and reduce communication costs.
Dataflow locality exists, because data communication patterns among static instructions are predictable. There are two independent, but complimentary, types of dataflow locality—static and dynamic. Static dataflow locality exists, because, in the absence of control, the producers and consumers of register values are precisely known. Within a basic block and between basic blocks that are not control dependent (e.g., the basic blocks before and after an If-Then-Else) the data communication patterns are completely static and, therefore, completely predictable. Dynamic dataflow locality arises from branch predictability. If a branch is highly predictable and almost always taken, for instance, then the static instructions before the branch frequently communicate with instructions on the taken path and rarely communicate with instructions on the not-taken path.
The vast majority of operand communication is highly predictable. Such high rates of predictability suggest that current processor communication systems are over-general, because they provide instructions with fast access to many more register values than needed. If the processor could exploit dataflow locality to ensure that necessary inputs were usually close at hand (at the expense of other potential inputs being farther away), they could reduce the average cost of communication.
Instead of simply ignoring dataflow locality, however, superscalars destroy it in their search for parallelism. Register renaming removes false dependencies, enables dynamic loop unrolling, and exposes a large amount of dynamic instruction level parallelism (ILP) for the superscalar core to exploit. However, it destroys dataflow locality. By changing the physical registers and instruction uses, renaming forces the architecture to provide each instruction with fast access to the entire physical register file, which results in a huge, slow register file and complicated forwarding networks.
Destroying dataflow locality leads to a shocking inefficiency in modern processor designs: The processor fetches a stream of instructions with a highly predictable communication pattern, destroys that predictability by renaming, and then compensates by using broadcast communication in the register file and the bypass network, combined with complex scheduling in the instruction queue. The consequence is that modern processor designs devote few resources to actual execution (less than 10%, as measured on a Pentium III die photo) and the vast majority to communication infrastructure. This infrastructure is necessary precisely because superscalars do not exploit dataflow locality.
Several industrial designs, such as partitioned superscalars like the Alpha 21264, some very long instruction word (VLIW) machines and several research designs have addressed this problem with clustering or other techniques and exploit dataflow locality to a limited degree. But none of these approaches make full use of it, because they still include large forwarding networks and register files. Accordingly, it would be desirable to provide an execution model and architecture built expressly to exploit the temporal, spatial, and dataflow locality that exists in instruction and data streams.
The von Neumann Model: Serial Computing
The von Neumann model of computation is very simple. It has three key components: a program stored in memory, a global memory for data storage, and a program counter that guides execution through the stored program. At each step, the processor loads the instruction at the program counter, executes it (possibly updating main memory), and updates the program counter to point to the next instruction (possibly subject to branch instructions).
Two serialization points constrain the von Neumann model and, therefore, superscalar processors. The first arises as the processor, guided by the program counter and control instructions, assembles a linear sequence of operations for execution. The second serialization point is at the memory interface where memory operations must complete (or appear to complete) in order to guarantee load-store ordering. The elegance and simplicity of the model are striking, but the price is steep. Instruction fetch introduces a control dependence between each instruction and the next and serves little purpose besides providing the ordering to which the memory interface must adhere. As a result, von Neumann processors are fundamentally sequential; there is no parallelism in the von Neumann processor model.
In practice, of course, von Neumann processors do achieve limited parallelism (i.e., instructions per cycle (IPCs) greater than one), by using several methods. The explicitly parallel instructions sets for VLIW and vector machines enable the compiler to express instruction and data independence statically. Superscalars dynamically examine many instructions in the execution stream simultaneously, violating the sequential ordering when they determine it is safe to do so. In addition, recent work introduces limited amounts of parallelism into the fetch stage by providing multiple fetch and decode units.
It has been demonstrated that ample ILP exists within applications, but that the control dependencies that sequential fetch introduces constrain this ILP. Despite tremendous effort over decades of computer architecture research, no processor comes close to exploiting the maximum ILP present in applications, as measured in limit studies. Several factors account for this result, including the memory wall and necessarily finite execution resources, but control dependence and, by extension, the inherently sequential nature of von Neumann execution, remain dominant factors. Accordingly, a new approach is needed to overcome the limitations of the von Neumann model.