1. Field of the Invention
The present invention relates generally to data driven processing machines and methods, and in particular relates to a processor node architecture--its structure and method of programming.
2. Description of the Prior Art
Computer architectures are being forced away from the traditional von Neumann architecture to attain the performances required to run present day large scientific codes. These new architectures require much work to map a code to effectively use the machine. Usually the problem must be explicitly partitioned among parallel processors--a difficult and time consuming task. A data driven processor, however, requires only that the programmer specify what operations must be carried out to solve the problem. The order of execution and the number of parallel processing elements participating in the execution are determined by the hardware without further direction.
In all uses of parallel processing, however, the scientific code must have a structure to compute with an array. Such codes are marked by the necessity to perform similar computations on many data items. The maximum parallelism is obtained in any machine by completely unfolding the program loops that process the arrays. This has the disadvantage of requiring a separate copy of the code for each iteration in the loop, which produces a significant penalty when dealing with loops that may run for thousands of passes. It is more advantageous to have only a few copies of a loop code and process different data through them.
In evaluating any array model, or system architecture, the commonly understood approach is to view the arrays as streets. The loop processing is then structured as a pipeline with one element of the stream in each stage of the pipeline. Each stage of the pipeline sends a "next" signal to the previous stage to indicate that it is ready for the next data items.
For example, an array processor can solve the following multiple, dependent step problem: EQU C.sub.i =A.sub.i +B.sub.i ( 1) EQU E.sub.i =C.sub.i *D.sub.I ( 2).
One series of processors solves equation (1) and each processor outputs the answer C.sub.i to a corresponding processor in a second series of processors, usually in response to a request (i.e, a "next" signal) for that C.sub.i. Simultaneously the corresponding processor requests the next D and uses the D.sub.i i corresponding to C.sub.i to solve equation (2). The E.sub.i values then appear at the output in order. The input values of A.sub.i, B.sub.i, and D.sub.i values are fed into the pipeline also in order. One of the few ways to obtain more parallelism is by having several copies of the pipeline, which requires more memory in exchange for more parallelism to exploit. This approach, however, can not be used in all situations. For example, this approach fails when the output data values are each a function of all of the input data values, which is the case when solving a system of equations.
The implementation of arrays as streams forces a sequential evaluation of the array elements (in the order they appear in the stream). This prevents a machine using this model of arrays from exploiting the inherent spatial concurrency available in many array operations. By contrast, vector machines are specifically optimized to take advantage of this spatial concurrency and realize most of their performance gains from this feature alone.
It has been proposed in Bagchi, "Arrays in Static Data Flow Computers", 1986 Proceedings of IEEE Region 5 Conference that data flow computers include traditional vector or array processors to exploit spatial concurrency. Such an inclusion would allow such concurrency to be exploited, but at the expense of the "purity" of the data flow model of computation. Corrupting the computing model in this manner would severely complicate the task of programming the machine due to the mixed models, and would also degrade the ability of a data flow machine to exploit fine grain parallelism everywhere in a problem. Exactly when to use the array processor and when to use the data flow processors is unclear, and efficient compilers for such a hybrid machine would be difficult if not impossible. A number of functional units is set apart for array computations, complicating the hardware design.
A similar array model (traditional control flow model) without the vector processors was discussed in Levin, "Suitability of a Data Flow Architecture for Problems Involving Simple Operations on Large Arrays;" Proceedings 1984 Int'l Conference on Parallel Processing, pp. 518-520, (August, 1985). Once again the machine was a hybrid of data flow and control flow models, resulting in many of the complications discussed above. In addition, the complication of the computing model including control flow array models made it difficult to provide enough array storage for the problem being studied. It also slowed the transfer of data between the models of computation.
Data flow architectures to date have addressed arrays in one or more of three ways: not at all, as streams, or as control flow "patches" to the data flow model. All three of these approaches have obvious shortcomings as described above. An array model is needed which is consistent with a data flow computing model and is able to exploit spatial concurrency.
However, a full appreciation of the problem must consider the basic architecture of data flow machines (DFM) and control flow machines (CFM). Data driven processing differs from control flow processing in many important ways. The design of a data driven processor is simpler than a control flow processor. The data driven processor is able to more effectively utilize pipelined execution. It is easier to specify the solution technique ("program") for a data driven processor--especially for parallel processing. The data storage is viewed differently in a data driven machine. Perhaps most importantly, a data driven machine is able to exploit more of the parallelism in a problem than can a traditional control flow machine. A more complete description of data driven computing theory may be found in the following: J. B. Dennis, "Data Flow Computation." In Control Flow and Data Flow: Concepts of Distributed Programing, Springer-Verlag, 1985; and K. P. Gostelow and R. E. Thomas, "Performance of a Simulated Dataflow Computer;" IEEE Transactions on Computers, C-29(10):905-919, October 1980, incorporated hereby by reference.
In traditional control flow processing the order of instruction execution is determined by a program counter. Each instruction is fetched from memory and decoded; data memory references are resolved and the operation performed; and the result is stored in memory. Differences in memory access times and inter-processor communication times can lead to varying minimum instruction times, complicating the processor design and limiting its sustained performance.
The architecture of a Data Flow Machine, on the other hand, uses the availability of data (rather than a program counter) to schedule operations. Once all the required parameters have been routed to an operation, all are automatically fed into the execution pipeline. The memory performs what was the control flow processor's fetching of instructions. Instructions are not "fetched" until all the data is ready, and thus there is never any wait time for memory access or interprocessor communication times. The "fetching" of an instruction sends the instruction together with all of its parameters to the execution pipeline. The machine's execution pipelines therefore stay full and operate at their maximum clock frequency as long as there are ready instructions anywhere in the program. The processor design is simplified since there is no memory system control or communication protocol within the processor.
The order of instruction execution on a control flow machine must be specified exactly. The code implementing an algorithm must ensure that data required by any instruction is current (i.e., all necessary previous computations have been done). This introduces extra work into the translation of the problem to its solution on the machine because now the machine must be told not only how to calculate the results but also when to compute them. Since data driven processing uses the availability of data to determine the order of instruction execution, the code to solve a problem does not need to specify the order of computation. The specification of how to solve the problem will give all the information required since a data driven processor can never execute an instruction before its data is ready.
Initial data and intermediate values computed in a control flow machine are stored in memory locations, and the instructions operate on the data in those storage locations. Initial data and intermediate values in a data driven machine have meaning only as they are associated with an operation. Indeed, there is no concept of a data store, only chains of operations that pass data along them.
Parallel control flow processing requires an additional specification of the location where each operation is to be done. The programmer must now dictate what operations are required to solve the problem, in what order to perform them, and which processor is used to perform them. The transmission of intermediate results between control flow processors must also be explicitly done by the programmer or the compiler. In a data driven machine the hardware and the availability of data determine where and when to perform the required operations. Communication between processors is just the direct transmission of data between operations and needs no more direction from the programmer than in a uni-processor code. Codes therefore generate the same results when run on a thousand processors as on a single processor, and the exact same code maybe run. The extension to parallel processing is solely a function of the machine and has none of the complications encountered in parallel control flow programming. See, for example, Gannon et al, "On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms;" IEEE Transactions on Computers, C-33(12), pp. 1180-1194, (December 1984); and Kuck et al, "The Effects of Program Restructuring, Algorithm Change, and Architecture Choice On Program Performance;" IEEE Transactions on Computers, pp. 129-138, (January 1984) both incorporated herein by reference.
Although maintaining computation balanced among traditional control flow processors is very difficult in a data driven parallel processor, it is possible to have the load perfectly balanced among the processors since ready instructions may be executed by any available processor.
Data driven processing also exploits more of the parallelism present in a problem than control flow can. At any time in a computation there may be many instructions whose operands are ready and may therefore be executed. A control flow processor would have to execute them in its predefined sequence, while a data driven processor may execute than in any order and in fact may execute them in parallel if additional processors are available. Since a computation is built around operations and not stored data, the operations may be stored in any of the multiple processors and the computation is still guaranteed to give the same result, only faster because more processors were available to work on it.
Examples of prior art implementation of bath control flow architectures and data flow architectures will place in perspective some of the present problems and some of the attempted solutions.
The NEDIPS data flow computer architecture is targeted for applications in image processing. This architecture is described in ITO et al, "NEDIPS: A Non-von Neumann High-Speed Computer", 78 NEC Research and Development, pp. 83-90 (July 1985). It, like many of the data flow architectures being developed, uses a special control flow processor to track the arrival of data items and match them with other data items to schedule instructions. Data that has arrived but is insufficient to fire an instruction must be held in a special queueing memory until the rest of the data arrives.
The Eazyflow Engine architecture is a demand driven (rather than data driven) machine. This architecture is described in the article, Jagannathan et al, "Eazyflow Engine Architecture", 1985 Conference on Computers and Communications (Phoenix) pp 161-165 (IEEE reference CH2154-3/85). Instructions are not evaluated unless their result is required by another instruction. "Wasted" computation are therefor avoided, but at potentially great hardware expense. A separate matching memory is used (similar to the NEDIPS machine) to track data. The article suggests that this could be implemented as a content addressable memory, an approach very costly in hardware complexity and speed. The other recommended implementation is to search the matching memory, an approach too costly to be feasible on a high performance machine.
A processing element proposed for a multiprocessing data flow machine at MIT is representative of many proposed data flow architectures. Arvind et al, "A Processing Element for a Large Multiple Processor Dataflow Machine", 1980 IEEE Int'l Conference on Circuits and Computers, pp 601-605 (IEEE reference CH1511-5/80). Once again a separate matching memory is used to track arriving data. The costs associated with such an approach are outlined above. When all the data required by an operation has arrived, the address of the operation is placed on a queue of ready instructions. The actual instruction must then be fetched and executed--much the same as in a control flow machine. Similarly U.S. Pat. No. 4,943,916 entitled, "Information Processing Apparatus for a Data Flow Computer," to Asano et al. uses tag fields to associate the required data with an operation. When an arriving datum enters the processor, Asano '916 searches the tag field of all other stored data and only if the tag fields are identical, is the stored data retrieved and an operation performed. Asano '916, however, doesn't disclose how this association is accomplished. Because of tag-based association, Asano's '916 tokens require more information than the tokens of the invention, and that information is the tag field. The problem Asano '916 specifically addresses is how to generate new tag values. The invention herein does not utilize tag fields to associate data, rather the data is directly matched in memory.
Another data flow processor from MIT (described in Dennis et al, "The Data Flow Engineering Model", 1983 IFIP Information Processing), uses microcoded processors to match data and instructions. The matching functions were then completely programmable, but many system clocks were required to determine if an instruction could fire. This processor is also believed to be the subject of U.S. Pat. Nos. 3,962,706, 4,145,733 and 4,153,932.
The SPS-1000 is a data flow architecture that utilizes data flow to schedule tasks. This is described in Fisher, "The SPS-1000: A Data Flow Architecture"; 1982 Peripheral Array Processors Proceedings, pp. 77-82. The parallelism it can exploit is therefore limited to task level parallelism and is opposed to an operational level parallelism. The processing elements are essentially control flow processors, with polling of main memory used to allow run-time scheduling of the various image processing tasks.
The Manchester Data Flow Computer, described in Burd, "The Manchester Dataflow Machine"; Proceedings of Int'l Symposium on Fifth Generation and Supercomputers" (December 1984), was the result of one of the first attempts to build a data flow computer. It shares many characteristics with its successors including a separate matching section and instruction queue. It, like the second MIT machine described above, relies heavily on microcoded processors for its major functions. This severely degrades the machine's overall throughput as described above. Being one of the first ventures into this field, this architecture was not aimed at parallel processing implementations.
A data flow computer architecture from France, the LAU computer described in Plas et al, "LAU System Architecture: A Parallel Data-Driven Processor", Proceedings of 1976 Int'l Conference on Parallel Processing, pp. 293-302 (1976), was able to exploit operation level parallelism. It also used tags to mark the status of instructions. The tags were explicitly manipulated by the microcoded processors and not automatically manipulated with memory accesses. It shared the disadvantages of its reliance on microcoded processors with the other similar machines described above.
The Data Flow Accelerator Machine, DFAM described in Davidson et al, "A Data Flow Accelerator for General Purpose Processors," Sandia National Lab. Tech. Report SAND-0710 (March 1986), developed at Sandia National Labs, is an intelligent memory that can be added to conventional multiprocessor implementations having a shared memory architecture Tagged memory is used, but the tags are used to track parameters for task level scheduling.
The foregoing review of some of the prior art computer architectures demonstrates that there is a need for new computer architectures suitable for massively parallel computing. The motivation is the present ever-increasing demand for processing throughput to run the scientific codes required by research and other engineering and applied science activities. The computer architectures need not be general purpose so long as the machines are able to solve a particular problem of interest and deliver significantly higher performance than currently available computers.
The operations required for the computation in data driven parallel processing do not change. The parameters required for a given computation also must remain the same. The only difference is that the operations are performed by physically separate processors. The actual order of instruction execution my change, out this is not a problem since a data driven processor by defintion cannot execute instructions which are not ready for execution. Problems, once formulated to run on a single data driven processor, can therefore be migrated unchanged to parallel processing implementations. This is in stark contrast to the large amount of work required to migrate a control flow code that was built to run on a uni-processor to a parallel processor.
Thus there is still needed a data flow machine (DFM) that incorporates all of the advantages of the prior art data flow machines, yet can execute instructions as soon as they arrive at the processor without waiting for further data or instruction fetches.