1. Field of the Invention
The present invention relates to methods and apparatus for performing high-speed digital computations of programmed large-scale numerical and logical problems, in particular to such methods and apparatuses making use of data-flow principles that allow for highly parallel execution of computer instructions and calculations.
2. Description of the Technology
In order to meet the computing requirements of future applications, it is necessary to develop architectures that are capable of performing billions of operations per second. Multiprocessor architectures are widely accepted as a class of architectures that will enable this goal to be met for applications that have sufficient inherent parallelism.
Unfortunately, the use of parallel processors increases the degree of complexity of the task of programming computers by requiring that the program be partitioned into concurrently executable processes and distributing them among the multiple processors, and that asynchronous control be provided for parallel process execution and inter-process communication. An applications programmer must partition and distribute his program to multiple processors, and explicitly coordinate communication between the processors or shared memory.
Applications programming is extremely expensive even using current single-processor systems, and is often the dominant cost of a system. Software development and maintenance costs are already very high without programmers having to perform the additional tasks described above. High-performance multiprocessor systems for which software development and maintenance costs are low must perform the extra tasks required for the programmer and be programmable in a high-level language.
There are different classes of parallel processing architectures that may be used to obtain high performance. Systolic arrays, tightly coupled networks of von Neumann processors, and data flow architectures are three such classes.
Systolic arrays are regular structures of identical processing elements (PEs) with interconnection between PEs. High performance is achieved through the use of parallel PEs and highly pipelined algorithms. Systolic arrays are limited in the applications for which they may be used. They are most useful for algorithms which may be highly pipelined to use many PEs whose intercommunications may be restricted to adjacent Pes (for example, array operations). In addition, systolic arrays have limited programmability. They are "hardwired" designs in that they are extremely fast, but inflexible. Another drawback is that they are limited to using local data for processing. Algorithms that would require access to external memories between computations would not be suitable for systolic array implementation.
Tightly coupled networks of von Neumann processors typically have the PEs interconnected using a communication network, with each PE being a microprocessor having local memory. In addition, some architectures provide global memory between PEs for interprocessor communication. These systems are most well suited for applications in which each parallel task consists of code that can be executed efficiently on a von Neumann processor (i.e., sequential code). They are not well suited for taking full advantage of low-level (micro) parallelism that may exist within tasks. When used for problems with low-level parallelism they typically give rise to large ALU (arithmetic and logical unit) idle times.
Data flow multiprocessor architectures based on the data flow graph execution model implicitly provide for asynchronous control of parallel process execution and inter-process communication, and when coupled with a functional high-level language can be programmed as a single PE, without the user having to explicitly identify parallel processes. They are better suited to taking advantage of low-level parallelism than von Neumann multiprocessor architectures.
The data flow approach, as opposed to the traditional control flow computational model (with a program counter), lets the data dependencies of a group of computational operations determine the sequence in which the operations are carried out. A data flow graph represents this information using nodes (actors) for the operations and directed arcs for the data dependencies between actors. The output result from an actor is passed to other actors by means of data items called tokens which travel along the arcs. The actor execution, or firing occurs when all the actor's input tokens are present on its input arcs. When the actor fires, or executes, it uses up the tokens on its input arcs, performs its intended operation, and puts result tokens on its output arcs. When actors are implemented in an architecture they are called templates. Each template consists of slots for an opcode, operands, and destination pointers, which indicate the actors to which the results of the operation are to be sent.
The data flow graph representation of an algorithm is the data dependency graph of the algorithm. The nodes in the graph represent the operators (actors) and the directed arcs connecting the nodes represent the data paths by which operands (tokens) travel between operands (actors). When all the input tokens to an actor are available, the actor may "fire" by consuming its input tokens, performing its operation on them, and producing some output tokens. In most definitions of data flow a restriction is placed on the arcs and actors so that an arc may have at most one input token on it at a time. This implies that an actor may not fire unless all of its output arcs are empty. A more general definition allows for each arc to be an infinite queue into which tokens may be placed.
All data flow architectures consist of multiple processing elements that execute the actors in the data flow graph. Data flow architectures take advantage of the inherent parallelism in the data flow graph by executing in separate PEs those actors that may fire in parallel. Data flow control is particularly attractive because it can express the full parallelism of a problem and reduce explicit programmer concern with interprocessor communication and synchronization.
In U.S. Pat. No. 3,962,706--Dennis et al., a data processing apparatus for the highly parallel execution of stored programs is disclosed. Unlike the present invention, the apparatus disclosed makes use of a central controller and global memory and therefore suffers from the limitations imposed by such an architecture.
U.S. Pat. No. 4,145,733--Misunas et al. discloses a more advanced version of the data processing apparatus described in U.S. Pat. No. 3,962,706. However, the apparatus disclosed still contains the central control and global memory that distinguish it from the present invention.
U.S. Pat. No. 4,145,733--Misunas et al. discloses another version of the apparatus disclosed in the previous two patents, distinguished by the addition of a new network apparently intended to facilitate expandability, but not related to the present invention.
In U.S. Pat. No. 4,418,383--Doyle et al. a large-scale integration (LSI) data flow component for processor and microprocessor systems is described. It bears no substantive relation to the processing element of the present invention, nor does it teach anything related to the data flow architecture of the present invention.
None of the inventions disclosed in the patents referred to above provides a processor designed to perform image and signal processing algorithms and related tasks that is also programmable in a high-level language which allows exploiting a maximum of low-level parallelism from the algorithms for high throughput.
The present invention is designed for efficient realization with advanced VLSI circuitry using a smaller number of distinct chips than other data flow machines. It is readily expandable and uses short communication paths that can be quickly traversed for high performance Previous machines lack the full capability of the present invention for large-throughput realtime applications in data and signal processing in combination with the easy programmability in a high-level language.
The present invention aims specifically at providing the potential for performance of signal processing problems and the related data processing functions including tracking, control, and display processing on the same processor. An instruction-level data flow (micro data flow) approach and compile time (static) assignment of tasks to processing elements are used to get efficient runtime performance.