Many are interested in the goal of general purpose computing that achieves very high speeds by exploiting parallelism in a scalable, cost-effective way. There seems to be widespread consensus that the architecture of such machines will be composed of a number of nodes interconnected with a high speed, regular network, where each node is built with an off-the-shelf microprocessor. Because such machines are built out of commodity parts, and because the topology is scalable, it is felt that such a machine with hundreds or thousands of nodes will be cheaper and faster than classical supercomputers, which are built with exotic technology and are thus very expensive.
To date, the prevailing opinion seems to be that microprocessors have their own evolutionary momentum (from CISC to RISC and, now, to a multiple instruction issue), and that a massively parallel machine will simply track this wave, using whatever microprocessors are currently available. However, a massively parallel machine is in fact a hostile environment for today's micros, arising largely because certain properties of the memory system in a massively parallel machine are fundamentally different from those assumed-during the evolution of these micros. In particular, most micros today assume that all memory is equally distant, and that memory access time can be made effectively small by cacheing. Both these assumptions are questionable in a massively parallel machine.
On the other hand, dataflow processors have been designed from the start by keeping in mind the properties of the memory system in a parallel machine. However, past dataflow processor designs have neglected single-thread performance, and hence must be classified as exotic, not the kind of processor to be found in commodity workstations.
To be cost-effective, the micros used in massively parallel machines should be commodity parts, i.e., they should be the same micros as those used in workstations and personal computers. Market forces are such that a lot more design effort can be expended on a stock microprocessor than on a processor that is sold only in small quantities. In addition, there is a question of software cost. Parallel programs are often evolved from sequential programs, and will continue to use components that were developed for single-thread uniprocessors (such as transcendental function libraries, Unix, etc.). This does not mean that we are restricted to using good, conventional microprocessors in any parallel machine that we build. All it means is that any new processor that we design for multiprocessors must also stand on its own as a cheap and viable uniprocessor.
Parallel programs contain synchronization events. It is well known that processor utilization suffers if it busy-waits; to avoid this, some form of multiplexing amongst threads (tasks or processes) is necessary. This is true even in uniprocessors.
In order to build parallel machines that are scalable both physically and economically, we must face the fact that inter-node latency in the machine will grow with machine size, at least by a factor of log (N), where N is the number of nodes in the machine. Thus, access to a non-local datum in a parallel machine may take tens to hundreds of cycles, or more. If we are to maintain effective utilization of the machine, a processor must perform some other useful work instead of idling during such a remote access. This requires that the processor be multiplexed amongst many threads, and that remote accesses must be performed as split transactions, i.e., a request and its response should be treated as two separate communication events across the machine. If we follow this argument a step further, we see that a communication entering a node will arrive at some relatively unpredictable time, and that we need some means of identifying the thread that is waiting for this communication. This is, in fact, a synchronization event.
Thus, the following picture emerges. In a parallel machine, the way to deal with long inter-node latencies is exactly the way to deal with synchronization. A program must be compiled with sufficient parallel slackness ("excess parallelism") so that every processor has a pool of threads instead of a single thread, and some threads are always likely to be ready to run. Each processor must be able to multiplex itself efficiently amongst these threads. All communications should be split transactions, in which (a) an issuing processor does not block to await a response, and (b) a receiving processor can efficiently identify and enable the thread that awaits an incoming communication. For a more thorough explication of this argument, please refer to Arvind and R. A. Iannucci, "Two Fundamental Issues in Multiprocessing," Proceedings of DFVLR--Conference 1987 on Parallel Processing in Science and Engineering, Bonn-Bad Godesberg, W. Germany, Springer-Verlag LNCS 295, Jun. 25-29, 1987.
Modern von Neumann microprocessors are excellent single-thread processors, but they are not designed to exploit parallel slackness efficiently. First, the cost of multiplexing amongst threads is high because of the enormous processor state that is associated with the currently executing thread. This state manifests itself in the register set and instruction and data caches, all of which may have to be reloaded with the new thread's context. Second, for a parallel environment, there is no efficient mechanism for naming, communicating and invoking continuations for split transactions to access remote locations. Third, many first-generation parallel machines had very poor interfaces to the interconnection network. There was a large software cost in handling incoming messages. This was further aggravated by the fact that messages trying to cross a node had to go through the node. However, many of the successors of these machines have solved this problem somewhat by devoting separate resources to message handling.
The net result is a high communication and synchronization cost with yon Neumann machines. Programs can be written to use these machines effectively provided they minimize the occurrence of communication and synchronization events, and there are many success stories that do so. However, there is a high software cost associated with trying to structure programs to fit this model, and it is still a far cry from our goal of truly general purpose computing.
Dataflow architectures have evolved substantially over the years. We will focus our comments on Monsoon (G. M. Papadopoulos, "Implementation of a General-Purpose Dataflow Multiprocessor," PhD thesis, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. 02139, August 1988; G. M. Papadopoulos and D. E. Culler, "Monsoon: An Explicit Token Store Architecture," Proc. 17th Intl. Symp. on Computer Architecture, Seattle, Wash., May 1990 and U.S. patent application Ser. No. 07/396,480) as the most recent representative of that evolution.
Dataflow architectures are excellent at exploiting parallel slackness. Indeed, this has always been a major underlying rationale for dataflow architectures. Parallel slackness is achieved by partitioning a program into extremely fine grain threads; in the pure dataflow model, each instruction is a separate thread. A thread descriptor is implemented as a token, which includes three parts (FP,IP,V), where:
FP is a frame pointer, which points at a frame relative to which the instruction will be executed; PA1 IP is an instruction pointer, which points to code; and PA1 V is a data value.
The pool of threads in a processor is manifest at a token queue. On each cycle, a token is extracted from the token queue, and the instruction to which it refers is executed by the processor relative to the frame to which it points. Every instruction explicitly names its successor instruction(s). As a result of this execution, zero, one, or two successor tokens are produced, which are placed back in the token queue. Thus, a dataflow processor like Monsoon can multiplex between threads on every cycle.
Split transactions are performed thus: when a processor wishes to read a remote location A, it executes a fetch instruction. This causes a "read" token to be constructed and injected into the network. Suppose the fetch instruction names label L as its successor instruction. The corresponding read request token contains the following information:
(READ, A, FP, L)
Once the read request token is sent out, the processor continues to execute other tokens in its token queue. When the read request token reaches the remote memory, the following token is sent back:
(FP, L, V)
This token is placed in the token queue to be executed just like any other token.
In addition, Monsoon also has an efficient mechanism to synchronize two threads. Two threads that must join will arrive at a common instruction that names a frame location which contains "presence bits", which can be regarded as a synchronization counter. On arrival, each thread causes the counter to decrement. When the first thread arrives, the counter does not reach its terminal value; the instruction is aborted and the processor moves on to execute another token from the token queue. When the second thread arrives, the counter reaches its terminal value and the instruction is executed.
Thus, dataflow architectures (and Monsoon in particular) provide good support for exploiting parallel slackness--fine grain threads, efficient multiplexing, cheap synchronization, and support for split transactions to mask inter-node latency.
However, present dataflow architectures do not have good single-thread performance. The fundamental problem is that present dataflow architectures do not provide adequate control over the scheduling of threads. In the pure dataflow model, successive tokens executed by the processor may refer to arbitrarily different frames and instructions. The consequence is that an instruction can transmit values to its successors only through slow memory--it cannot exploit any special high speed storage such as registers and caches. In conventional uniprocessors, caches allow fast transmission of values because the successor instruction is executed immediately, while a previously stored value is still in the cache. This locality through successor-scheduling is absent in pure dataflow models. Pure dataflow models allow exactly one value to be transmitted without going to memory--the value on the token.
Monsoon improves on this situation. In Monsoon, an instruction can annotate one of its successors so that it is executed directly, i.e., instead of placing the token back into the token queue, it is recirculated directly into the processor pipeline. Thus, in a chain of such direct successors, instructions can communicate values down the thread via high speed registers--no other thread can intervene to disturb the registers. However, Monsoon still has some engineering limitations that limit single-thread performance, namely, (a) very few registers (only three) and (b) the processor pipeline is eight cycles long, so that each instruction in a chain takes eight cycles.
In Monsoon, control over scheduling stops at this point. A chain of direct successors is broken when it reaches an instruction that is a split transaction instruction (like a load), or when it reaches an instruction that executes a join that fails. At this point, there is no further control on the next thread to be executed. If we had such control, we might, for example, choose another thread from the same frame, to maintain locality with respect to the current frame.