It is widely accepted that Moore's Law growth in available transistors will continue for the next decade. Recent research, however, has demonstrated that simply scaling up current architectures will not convert these new transistors to commensurate increases in performance. This gap between the performance improvements that are needed and those that can be realized by simply constructing larger versions of existing architectures will fundamentally alter processor designs.
Three problems contribute to this gap, creating a processor scaling wall. The problems include the ever-increasing disparity between computation and communication performance—fast transistors, but slow wires; the increasing cost of circuit complexity, leading to longer design times, schedule slips, and more processor bugs; and the decreasing reliability of circuit technology, caused by shrinking feature sizes and continued scaling of the underlying material characteristics. In particular, modern superscalar processor designs will not scale, because they are built atop a vast infrastructure of slow broadcast networks, associative searches, complex control logic, and inherently centralized structures that must all be designed correctly for reliable execution. Like the memory wall, the processor scaling wall has motivated a number of research efforts. These efforts all augment the existing program counter-driven von Neumann model of computation by providing redundant checking mechanisms (see for example, the work by T. M. Austin, “DIVA: A reliable substrate for deep submicron microarchitecture design,” International Symposium on Microarchitecture, 1999); exploiting compiler technology for limited dataflow-like execution, as disclosed by R. Nagarajan et al., “A design space evaluation of grid processor architectures,” International Symposium on Microarchitecture, 2001; or efficiently exploiting coarse grained parallelism, as proposed by K. Mai et al., “Smart memories: A modular reconfigurable architecture,” International Symposium on Computer Architecture, 2002, or as disclosed by E. Waingold et al., “Baring it all to software: Raw machines,” IEEE Computer, vol. 30, no. 9, 1997.
A Case for Exploring Superscalar Alternatives
The von Neumann model of execution and its most sophisticated implementations, out-of-order superscalars, have been a phenomenal success. However, superscalars suffer from several drawbacks that are beginning to emerge. First, their inherent complexity makes efficient implementation a daunting challenge. Second, they ignore an important source of locality in instruction streams; and third, their execution model centers around instruction fetch, an intrinsic serialization point.
As features and cycle times shrink, the hardware structures that form the core of superscalar processors (register files, issue windows, and scheduling logic) become extremely expensive to access. Consequently, clock speed decreases and/or pipeline depth increases. Indeed, industry recognizes that building ever-larger superscalars as transistor budgets expand can be impractical, because of the processor scaling wall. Many manufacturers are turning to larger caches and chip multiprocessors to convert additional transistors into increased performance without impacting cycle time.
To squeeze maximum performance from each core, architects constantly add new algorithms and structures to designs. Each new mechanism, optimization, or predictor adds additional complexity and makes verification time an ever increasing cost in processor design. Verification already consumes about 40% of project resources on complex designs, and verification costs are increasing.
Untapped Locality
Superscalars devote a large share of their hardware and complexity to exploiting locality and predictability in program behavior. However, they fail to utilize a significant source of locality intrinsic to applications, i.e., dataflow locality. Dataflow locality is the predictability of instruction dependencies through the dynamic trace of an application. A processor could take advantage of this predictability to reduce the complexity of its communication system (i.e., register files and bypass networks) and reduce communication costs.
Dataflow locality exists, because data communication patterns among static instructions are predictable. There are two independent, but complimentary, types of dataflow locality—static and dynamic. Static dataflow locality exists, because, in the absence of control, the producers and consumers of register values are precisely known. Within a basic block and between basic blocks that are not control dependent (e.g., the basic blocks before and after an If-Then-Else) the data communication patterns are completely static and, therefore, completely predictable. Dynamic dataflow locality arises from branch predictability. If a branch is highly predictable and almost always taken, for instance, then the static instructions before the branch frequently communicate with instructions on the taken path and rarely communicate with instructions on the not-taken path.
The vast majority of operand communication is highly predictable. Such high rates of predictability suggest that current processor communication systems are over-general, because they provide instructions with fast access to many more register values than needed. If the processor could exploit dataflow locality to ensure that necessary inputs were usually close at hand (at the expense of other potential inputs being farther away), they could reduce the average cost of communication.
Instead of simply ignoring dataflow locality, however, superscalars destroy it in their search for parallelism. Register renaming removes false dependencies, enables dynamic loop unrolling, and exposes a large amount of dynamic instruction level parallelism (ILP) for the superscalar core to exploit. However, it destroys dataflow locality. By changing the physical registers and instruction uses, renaming forces the architecture to provide each instruction with fast access to the entire physical register file, which results in a huge, slow register file and complicated forwarding networks.
Destroying dataflow locality leads to inefficiencies in modern processor designs: The processor fetches a stream of instructions with a highly predictable communication pattern, destroys that predictability by renaming, and then compensates by using broadcast communication in the register file and the bypass network, combined with complex scheduling in the instruction queue. The consequence is that modem processor designs devote few resources to actual execution (less than 10%, as measured on an Intel Corporation Pentium III™ die photo) and the vast majority to communication infrastructure.
Several industrial designs, such as partitioned superscalars like the Alpha 21264, some very long instruction word (VLIW) machines, and several research designs have addressed this problem with clustering or other techniques, and exploit dataflow locality to a limited degree. But none of these approaches make full use of it, because they still include large forwarding networks and register files. Accordingly, it would be desirable to provide an execution model and architecture built expressly to exploit the temporal, spatial, and dataflow locality that exists in instruction and data streams.
The Von Neumann Model: Serial Computing
The von Neumann model of computation is very simple. It has three key components: a program stored in memory, a global memory for data storage, and a program counter that guides execution through the stored program. At each step, the processor loads the instruction at the program counter, executes it (possibly updating main memory), and updates the program counter to point to the next instruction (possibly subject to branch instructions).
Two serialization points constrain the von Neumann model and, therefore, superscalar processors. The first arises as the processor, guided by the program counter and control instructions, assembles a linear sequence of operations for execution. The second serialization point is at the memory interface where memory operations must complete (or appear to complete) in order to guarantee load-store ordering. The elegance and simplicity of the model are striking, but the price is steep. Instruction fetch introduces a control dependence between each instruction and the next and serves little purpose besides providing the ordering to which the memory interface must adhere. As a result, von Neumann processors are fundamentally sequential.
In practice, of course, von Neumann processors do achieve limited parallelism (i.e., instructions per cycle (IPC) greater than one), by using several methods. The explicitly parallel instructions sets for VLIW and vector machines enable the compiler to express instruction and data independence statically. Superscalars dynamically examine many instructions in the execution stream simultaneously, violating the sequential ordering when they determine it is safe to do so. In addition, recent work introduces limited amounts of parallelism into the fetch stage by providing multiple fetch and decode units.
It has been demonstrated that ample ILP exists within applications, but that the control dependencies that sequential fetch introduces constrain this ILP. Despite tremendous effort over decades of computer architecture research, no processor comes close to exploiting the maximum ILP present in applications, as measured in limit studies. Several factors account for this result, including the memory wall and necessarily finite execution resources, but control dependence and, by extension, the inherently sequential nature of von Neumann execution, remain dominant factors. Accordingly, a new approach is needed to overcome the limitations of the von Neumann model.
WaveScalar—A New Approach
An alternative to superscalar architecture that has been developed is referred to herein by the term “WaveScalar.” WaveScalar is a datafrow architecture. Unlike past dataflow work, which focused on maximizing processor utilization, WaveScalar seeks to minimize communication costs by avoiding long wires and broadcast networks. To this end, it includes a completely decentralized implementation of the “token-store” of traditional dataflow architectures and a distributed execution model. Commonly assigned U.S. patent application, Ser. No. 11/041,396, which is entitled “WAVESCALAR ARCHITECTURE HAVING A WAVE ORDER MEMORY,” describes details of this dataflow architecture, and the drawings and specification of this application are hereby specifically incorporated herein by reference.
The key difference between WaveScalar and prior art dataflow architectures is that WaveScalar efficiently supports traditional von Neumann-style memory semantics in a dataflow model. Previously, dataflow architectures provided their own style of memory semantics and their own dataflow languages that disallowed side effects, mutable data structures, and many other useful programming constructs. Indeed, a memory ordering scheme that enables a dataflow machine to efficiently execute code written in general purpose, imperative languages (such as C, C++, Fortran, or Java) has eluded researchers for several decades. In contrast, the WaveScalar architecture provides a memory ordering scheme that efficiently executes programs written in any language.
Solving the memory ordering problem without resorting to a von Neumann-like execution model enables a completely decentralized dataflow processor to be built that eliminates all the large hardware structures that make superscalars nonscalable. Other recent attempts to build scalable processors, such as TRIPS (R. Nagarajan et al., “A design space evaluation of grid processor architectures,” International Symposium on Microarchitecture, 2001 and K. Sankaralingam et al., “Exploiting ILP, TLP, and DLP with the polymorphous trips architecture,” in International Symposium on Computer Architecture, 2003), Smart memories (K. Mai et al., “Smart memories: A modular reconfigurable architecture”, in International Symposium on Computer Architecture, 2002) and Raw (W. Lee et al., “Space-time scheduling of instruction-level parallelism on a Raw machine,” International Conference on Architectural Support for Programming Languages and Operating Systems, 1998), have extended the von Neumann paradigm in novel ways, but they still rely on a program counter to sequence program execution and memory access, limiting the amount of parallelism they can reveal. WaveScalar completely abandons the program counter and linear von Neumann execution.
WaveScalar is currently implemented on a substrate comprising a plurality of processing nodes that effectively replaces the central processor and instruction cache of a conventional system. Conceptually, WaveScalar instructions execute in-place in the memory system and explicitly send their results to their dependents. In practice, WaveScalar instructions are cached in the processing elements—hence the name “WaveCache.”
The WaveCache loads instructions from memory and assigns them to processing elements for execution. They remain in the cache over many, potentially millions, of invocations. Remaining in the cache for long periods of time enables dynamic optimization of an instruction's physical placement in relation to its dependents. Optimizing instruction placement also enables a WaveCache to take advantage of predictability in the dynamic data dependencies of a program, which is referred to herein as “dataflow locality.” Just like conventional forms of locality (temporal and spatial), dataflow locality can be exploited by cache-like hardware structures.
Multithreading
Multithreading is an effective way to improve the performance of a computing system, and designers have long sought to introduce architectural support for threaded applications. Prior work includes hardware support for multiple thread contexts, mechanisms for efficient thread synchronization, and consistency models that provide threads with a unified view of memory at lower cost. Because of this large body of work and the large amount of silicon resources available, threaded architectures are now mainstream in commodity systems.
Interestingly, no single definition of a thread has proven suitable for all applications. For example, web servers and other task-based systems are suited to coarse-grain, pthread-style threads. Conversely, many media, graphics, matrix, and string algorithms contain significant fine-grain data parallelism. In addition, sophisticated compilers are capable of detecting parallelism on several levels—from instructions to loop bodies to function invocations.
Individual architectures, however, tend not to support this heterogeneity. Threaded architectures usually target a specific thread granularity or, in some cases, a small number of granularities, making it difficult or inefficient to execute and synchronize threads of a different grain. For example, extremely fine-grain applications cannot execute efficiently on a shared memory multiprocessor due to the high cost of synchronization. In contrast, dataflow machines provide excellent support for extremely fine-grain threads, but must be programmed in specialized languages to correctly execute traditional coarse-grain applications. This requirement stems from dataflow's inability to guarantee that memory operations will execute in a particular order.
In principle, if it could solve the ordering issue, a dataflow architecture like WaveScalar could support a wide range of thread granularities by decomposing coarse-grain threads into fine-grain threads. It would thus be particularly useful to employ such an approach in the WaveScalar architecture to achieve even greater efficiencies in processing than can be achieved using only ordered memory for processing coarse grain threads.
Adding thread support to an architecture requires that designers solve several problems. First, they must determine what defines a thread in their architecture. Then, they must simultaneously isolate threads from one another and provide mechanisms, such as access to shared state and synchronization primitives, that allow them to communicate. Popular multithreaded systems such as SMPs, CMPs (see K. Olukotun et al., “The case for a single-chip multiprocessor,” in Architectural Support for Programming Languages and Operating Systems, 1996), and SMTs (see D. M. Tullsen, et al., “Simultaneous Multithreading: Maximizing on-chip parallelism,” in International Symposium on Computer Architecture, 1995) define a thread in terms of its state, including a register set, a program counter, and an address space. In multiprocessors, thread separation is easy, because each thread has its own dedicated hardware and threads can only interact through memory. SMTs and other processors that support multiple thread contexts within a single pipeline (e.g., Tera (see R. Alverson et al., “The Tera computer system,” in International Conference on Supercomputing, pp. 1-6, 1990)) must exercise more care to ensure that threads do not interfere with one another. In these architectures, threads can communicate through memory, but other mechanisms are also possible.
The *T machine (see B. S. Ang et al., “StarT the next generation: integrating global caches and dataflow architecture,” Tech. Rep. CSGmemo-354, MIT, 1994), the J-Machine (see M. Noakes et al., “The j-machine multicomputer: An architecture evaluation,” 1993), and the M-machine (see M. Fillo et al., “The M-machine multicomputer,” in International Symposium on Computer Architecture, 1995) define threads in similar terms but support two thread granularities. They use fine-grain threads to enable frequent communication (J-machine, *T) or hide latency (M-machine). Coarse grain threads handle long-running, complex computations (J-machine, *T) or group fine-grain threads for scheduling (M-machine). Threads communicate via shared memory (J-machine, M-machine), message passing (J-machine), and direct accesses of another thread's registers (M-machine).
The Raw machine offers flexibility in thread definition, communication, and granularity by exposing the communication costs between tiles in a CMP-style grid architecture. A thread's state is at least the architectural state of a single tile, but could include several tiles and their network switches. Threads communicate through shared memory or by writing to the register files of adjacent tiles. For tightly synchronized threads, the compiler can statically schedule communication to achieve higher performance.
The TRIPS processor supports multiple threads by reallocating resources that it would otherwise dedicate to speculatively executing a single thread. In essence, it uses multiple threads to hide memory and branch latencies instead of speculating. The parameters that define a thread remain similar to other architectures.
The EM-4 hybrid dataflow machine (see M. Sato et al., “Thread-based programming for the EM-4 hybrid dataflow machine,” in International Symposium on Computer Architecture, 1992; and S. Sakai et al., “An architecture of a dataflow single chip processor,” in International Symposium on Computer Architecture, 1989) defines a thread using a set of registers and a memory frame. Synchronization is performed in a dataflow style, and the programmer is provided with library routines that make synchronization explicit.
The similarity in thread representation among these architectures reflects their underlying architectures—all are essentially small, register-based, PC-driven fetch-decode-execute-style processors. In contrast, WaveScalar is a dataflow architecture, though not the first to grapple with the role of threads. Most notably, the Monsoon architecture (see G. M. Papadopoulos et al., “Monsoon: An explicit token-store architecture,” in International Symposium on Computer Architecture, 1990; and G. M. Papadopoulos et al., “Multithreading: A revisionist view of dataflow architectures”, International Symposium on Computer Architecture, 1991), the P-RISC architecture (see R. S. Nikhil et al., “Can dataflow subsume von Neumann computing?,” in International Symposium on Computer Architecture, 1989, Computer Architecture News, 17(3), June 1989) and the Threaded Abstract Machine (TAM) architecture (see D. E. Culler et al., “Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine,” in Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, 1991) have developed, to different extents, a model of dataflow machines as systems of interacting, fine-grained imperative threads.
P-RISC adapts ideas from dataflow to von Neumann multiprocessors. To this end, it extends a RISC-like instruction set with fork and join instructions and the notion of two-phase memory operations. Programs consist of numerous small imperative threads with small execution contexts. Whenever a thread blocks on a long-latency operation, such as a remote load, another thread is -removed from a ready queue (called the token queue) and executes. Synchronization between threads is handled with explicit memory instructions.
Programs for the Monsoon Explicit Token Store (ETS) architecture can be organized as collections of short, von Neumann-style threads that interact with each other and with memory using dataflow-style communication. The technique improves code scheduling by taking advantage of data locality. It also leads to an extension to the architecture in which the short, imperative threads employ a small set of high-speed temporary registers that are not part of the threads' stored context. Synchronization between threads is implicit, through the dataflow firing rule and presence bits in memory.
The TAM architecture adapts the Monsoon and P-RISC ideas to take advantage of hierarchical memory and scheduling systems. It does this by allowing the compiler more authority in scheduling code and data, adding a new level of scheduling hierarchy (called a quantum), and restricting communication between different groups of threads to well-defined communication interfaces. Synchronization between threads is explicit, as in P-RISC.
However, none of the prior art discussed above provides a workable solution for adapting an ordered memory dataflow architecture, such as WaveScalar, as necessary for enabling efficient processing using both fine- and coarse-grained threads. Accordingly, there is a need to provide a solution that includes the benefits of the dataflow architectures with such a multigrained thread processing capability.