The use of parallel processors to increase the rate of computation is known in the art. Associated with such parallel structures is a high overhead requirement for communication and synchronization, as well as overhead expenses associated with inefficient use of processing units, such as by underutilization of the processing capability thereof.
More specifically, to increase throughput and computing speed, a plurality of processors are used in a parallel arrangement, known as a data flow architecture, in which a number of substantially identical modules are assigned to a solution of a particular problem.
Such a parallel processing architecture may be used when the problem being solved may be decomposed into a number of subproblems, each of which may be processed by one of the modules.
Illustratively, a data dependency graph, or data flow graph, is shown in FIG. 1. The graph illustrates a decomposition of a program for solving a complex problem by performing a number of simpler individual subprograms or tasks, shown at individual nodes of the graph. Arcs are shown to illustrate the flow of data among the nodes, and thus to identify the dependency of the subprogram performed at one node on the results of computations performed at other nodes.
In the flow graph of FIG. 1, input data is provided as X and Y, for example. Nodes 1-6 represent various computations which rely on the input data and on intermediate results determined by other nodes. All parameters, intermediate results, or input data necessary to perform the computation at a node are illustrated by incident arcs flowing into that node. The result of the computation at the node is sent by an arc originating at the node and incident on other nodes requiring that result.
A computation at a node begins as soon as al data required for the computation is present on the arcs incident on the node. In such a decomposition of a larger program into a sequence of tasks of reduced "grain" size, the progression of computations is seen to depend only on the flow of data. This dependence leads to the name "data flow computation" for the technique.
Prior implementations of data flow computational architectures are known, in which the topology of a data flow graph, representing a sequential program to be performed, is reflected by the architectural arrangement of the modules. For example, each node of the flow graph, representing an operation, is implemented by a template defined by a sequence of macro instructions. Such templates are shown in FIG. 2, arranged to perform a computation illustrated by the data flow graph of FIG. 1. As seen from FIG. 2, each template contains an operation identifier 8, slots 9 for one or more operands to be filled by data tokens as they arrive, and links 11 to other templates which are awaiting the result of the computation of the present template.
FIG. 3 shows a block diagram of a known data flow computer 10 for executing templates. Therein, a dispatcher 12 passes ready templates 14, having all required data, to an available processing module 16. When the computation is complete the template, along with its result, is passed to a matcher 18, which sends the results of the computation to all templates needing the result, as indicated by the link fields in the template. Those templates which are partially ready, i.e., those which have not yet completed the assigned computation, are kept in a storage section 20 for incomplete templates.
The matcher 18 matches newly available results, or operands, with the incomplete templates. If a particular operand is the last required by an incomplete template, the template is passed on to the dispatcher 12. The illustration of FIG. 3 describes the flow of the templates with their operands to the individual processors. From FIG. 3 the importance of the matcher is apparent. Such a unit is required to be quite fast in operation to keep from delaying computations, since any delay s multiplied by subsequent delays repeatedly introduced thereby. Moreover, it is also apparent that the storage of the matcher is required to be sufficiently large to hold all the partial, or incomplete, templates. These requirements tend to increase the expense associated with data flow computers.
A typical data flow computer as shown in FIG. 3 requires an assignment of the specific templates to particular ones of the processors. In one approach, the dispatcher is made explicitly aware of which of the processors are not busy.
In an alternative approach illustrated by the EMSP computer of AT&T, described below, a transfer circuit is provided for passing the templates along a distribution bus which is accessed by each of the processors. In this approach, an idle processor plucks the template from the bus and executes the indicated operation or subprogram. Any template passing through all the processors without being picked up for computation must be passed again along the bus.
From the foregoing, it is seen that the data flow computers of the prior art require that sufficient capacity be provided in the distribution circuit and the distribution bus for simultaneously holding all the ready templates. Alternatively, sufficient storage must be provided for the templates and the results.
In one known implementation of a data flow computer, an iterative array is formed of signal processing chips of the type identified as uPD7281, provided by NEC. Each chip includes input and output controllers, connected by a bus and by a circular pipeline including: a link table for storing instruction parameters; a function table for instruction parameters; a data memory for constants or temporary data; FIFO data and generator queues; a processing unit for executing logical, arithmetic, and bit operations; and an address generator and flow controller for generating addresses for the data memory and for controlling the flow of tokens. A refresh controller generates refresh tokens for internal DRAMs and is connected to the input controller, and an output queue receives the output tokens from the data and generator queue and provides the same to the output controller. The array is arranged to form a large pipeline, wherein the output of one chip is provided as the input to the next.
In another architecture, known as the AT&T Technologies Enhanced Modular Signal Processor (EMSP) computer, a command program processor, an input/output processor and a number of arithmetic processors are connected to receive data from one data transfer network and to provide the results to a second data transfer network. The command program processor is in communication with displays and a separate tactical computer. The input/output processor receives sensor data and provides outputs to post processing displays, and the like. A scheduler and a number of global memory units receive data from the second data transfer network and provide operands to the first transfer network and further provide information to the command program processor.
In operation, data arrives from external sources and is stored in global memory queues. When a queue is full, the data transfer network, which is a cross bar switch, connects the queue to a free processing element. A primitive operation identifier and the address of the queue are given to that processor. In this system, the microcode for the primitives is resident in each of the processors in the system. The primitives are large scale computations found in signal processing, such as an n-point fast Fourier transform (FFT). The architecture of the EMSP is characterized by the three features of data accumulating queues, use of the cross bar switch rather than a circulating bus, and the storage of the microprograms for all of the primitives in each of the processors.
Still another data flow computer is known as the Manchester data flow computer, which is similar to the description of the general device of the prior art as previously described with reference to FIG. 3. This computer is a heavily microcoded computer, with extensive matching circuitry. Performance thereof is reportedly comparable to a VAX 11/750 computer.
The structure includes an I/O switch in communication with the host and providing token packets to a token queue. From the queue, packets are provided to a matching unit which is paralleled by an overflow unit. An instruction store receives token-pair packets from the matching unit and provides executable packets to a processing unit which, in turn, provides token packets to the I/O switch.
Still another data flow computer, known as the MIT static Data Flow computer, is similar to the structure of FIG. 3 in many respects. However, an explicit control network is provided to provide control packets to the memory having the various instruction cells therein. An arbitration network receives the appropriate instructions from the cells. The arbitration network provides decision units to the control network and operation units to a distribution network for distribution to the memory cells.
A French LAU computer uses tag bits C0, C1, C2 and Cd to record when an instruction is ready for execution. The instructions are typically single operators. The LAU computer has microcoded processors built from AMD-2900 bit slice components. However, the tags are deeply connected to the micro-instructions. The tag bits, which are directly set and cleared by the microcode processors rather than being the side effect of a memory access, thus require considerable communication overhead. The LAU multiprocessor includes a group of elementary execution processors which are updated, read from and written into by external control units for instructions and data. A subsystem memory reads and writes operands from and to the processors, and provides instructions thereto. The control units and subsystem memory communicate the readiness of the instructions, and further communication with an interface which is in communication with the host minicomputer.
As will be appreciated from the foregoing descriptions, the known data flow computers are typically extensions of processor technology, and use message passing techniques to communicate the templates and operands among the various sections thereof. It is thus a drawback of the prior art that significant storage and data handling capacity is required.
In view of the previously described prior art devices, it should be appreciated that, to increase utilization of the computing capacity of the system, and thus to reduce the computing overhead associated therewith, the size and complexity of the subprograms to be executed by the various computational modules should be made smaller. If the subproblems or tasks to be solved by the individual processors are of large size, or granularity, a requirement that each of the plural processors must be capable of performing the largest task size results in significant underutilization of the processing capabilities of some of the modules.
The degree of underutilization is increased by the large grain size of the subproblems, since a potential variation of 10% or 20% in complexity of the large subproblems results in requirement for each of the modules to have processing capability which is 10% to 20% larger than a base average capability, which is itself large. On the other hand, in a reduced grain size arrangement the average processing capability is reduced, uniformity in processing requirement is increased, and the variation from the average requirement is also reduced.
Thus, by reducing the grain size of the programs or tasks to be performed by each module, a more efficient use can be had of the parallel processing facilities, with reduced underutilization. Accordingly, with a higher degree of utilization of computing facilities fewer processors may be used, so that a less expensive system results. However, historically such attempts at smaller grain computations have led to larger costs in communication and synchronization.
There is thus a need in the prior art to provide more efficient use of parallel processors, by reducing the grain size of the programs or tasks, while simultaneously reducing the communication and synchronization costs associated therewith.