The invention relates generally to field of parallel computer architectures and more particularly to a sequencing and fan-out mechanism for use in dataflow machines. Dataflow machines are those machines whose actions are determined by the availability of the data needed for those actions. The invention enables greater availability of data when it is needed, and greater flexibility and efficiency for the actions that can be accomplished.
Recently, parallel computing has achieved two advances: first, that it is possible to efficiently execute programs as dataflow graphs thereby dynamically exploiting the maximum amount of parallelism; and second, that, if feasible, static direction of parallel execution can be very efficient. The ability to dynamically exploit parallelism is required to execute a general class of problems in parallel, but the cost of synchronization and context switching determines the granularity of the parallelism that can be efficiently exploited. Therefore, it is important to minimize these costs. As the costs are minimized, more parallelism is exposed for masking latency, filling pipelines, and keeping multiple processors busy. Ideally, the system should statically schedule finer grains to pay the synchronization cost for only the amount of parallelism required.
A high performance computer architecture may exploit parallelism in several ways. Traditionally, uniprocessors use parallelism to allow pipelined execution. Multiprocessors use parallelism to keep multiple processors busy. In general, however, a processor must use parallelism to mask long unpredictable memory latencies, or suffer a performance degradation. If a processor is unable to mask unpredictable latencies with useful work, it simply idles.
Parallelism may be exploited under explicit direction from the programmer, or preferably, it may be exploited as necessary by the architecture. For any program, parallelism is most abundant at the operation level. Partitioning a problem into larger grains will mask inherent parallelism. This technique may be necessary if the target architecture cannot efficiently switch between parallel activities, and may be desired if there is still sufficient parallelism at this larger grain size. But if the synchronization between parallel activities is not efficient, the use of fine grain parallelism will not be practical.
The architecture of U.S. patent application, Ser. No. 223,133, abandoned, now U.S. continuation application Ser. No. 07/559,523, filed Jul. 24, 1990, entitled "Data Flow Machine for Data Driven Computing," by Davidson et al., has demonstrated that very low cost synchronization and context switches at the instruction level are feasible. However, that architecture does not take advantage of static scheduling to further reduce the parallel overhead and, therefore, the execution time. Other architectures take advantage of static scheduling for more efficient execution, but these prior art architectures have poor support for run-time, dynamic parallelism.
New parallel computing architectures do not encompass the traditional von Neumann architecture because of the performance demands for processing large scientific codes. The new architectures require mapping, that is, that the problem be explicitly partitioned among parallel processors. Mapping can be a difficult and time consuming task.
The problem of data fanout within a processor is unique to dataflow machines. A conventional control flow machine accesses data by reading from a particular memory location, as in U.S. Pat. No. 4,858,105, entitled "Pipelined Data Processor Capable of Decoding and Executing Plural Instructions in Parallel," to Kuriyama et al. Kuriyama et al., however, as with all control flow machines, cannot schedule instruction on the availability of data and, consequently, exploit fine grain parallelism. A control flow machine can access any data item any number of times for any purpose by reading the data item from memory or from some internal register. A dataflow processor can emulate this method of operation by reading the data item for each use, but this method adds unnecessary overhead to the computation and slows the processing considerably. So, to reduce the overhead, a dataflow processor writes data items to instructions that require them. Usually, in a dataflow processor, a data item and an address are transmitted together as a token to a processor. The system works well as long as there are just a few uses of each data item, but slows down when a data item is needed by many instructions. U.S. Pat. No. 4,943,916, entitled "Information Processing Apparatus for a Data Flow Computer," to Asano et al. teaches tagging data for use with other data of the same tag to be operated upon by the same instruction; however, Asano et al. do not even suggest that sequential scheduling of instructions is possible.
There have been several previous approaches to distributing the data to the processors executing the instructions, or data fanout. One approach duplicates the overhead of a control flow processor, surrendering many of the benefits of dataflow processing. A second approach allows a large number of destinations for each computation, a system which uses more complicated hardware to speed the process when fanout is required, but which slows the process when fanout is not required. A third approach uses trees of instructions to generate copies of the data items, allowing the highest speed processors to be built, but requiring the data to be duplicated for fanout at a considerable cost of execution time.
Hence, it is a purpose of this invention to provide for an improved data fanout mechanism in a dataflow machine. This object is achieved by preprogrammed repeated copying of the data so that the fanout occurs in a known order and with known timing so that the data is available when needed. Moreover, the data fanout mechanism of the invention reduces the latencies caused by previous methods of data fanout because it does not require use of the processor's execution resources.
An additional advantage of this data fanout mechanism is reduced network bandwidth requirements because the responsibility for making copies of a data item is placed on the processing node which is the user of the copies, rather than the data producer. In addition, producers of data can be connected to multiple users at different times which allows greater usability of code.
It is another object of this invention to provide information in the incoming data token to cause the processor to repeat the data as necessary. This object is achieved because the data is repeated to the instructions according to offsets and repeat counts in the token format.
It is another object of the invention to attain efficient sequential execution of instructions in a dataflow machine. Prior to this invention, in a pure dataflow machine, the processor didn't have the capability of realizing which data and which instruction was immediately preceding or immediately following the data being processed by the current instruction. But with the present invention, because of the invention's capability to know which data values are accessible and the order in which instructions are to be executed, one data item or one instruction can be transferred to high speed registers and then back into the ALU for use in the next instruction, as opposed to generating a resultant data token and placing that token into queue and back through the pipeline for execution by another instruction.
It is yet another object of the invention to allow dataflow scheduling of fully ordered sequences of instructions scheduled as a unit by, for instance, including a register file to store temporary results computed within an ordered instruction sequence. This object is achieved by ordering a series of instructions requiring repeated use of a token, and the ability of the invention to store and access temporary results, e.g., in a high speed register.
It is yet another object of the invention to increase the ability of dataflow machines to exploit the teachings of von Neumann and control flow computing in terms of hardware, configuration, compiling techniques, and the availability of algorithms. This object is achieved by scheduling ordered sequences of instructions and sharing data within these sequences.
It is another object of the invention to have high speed dataflow processing with excellent uniprocessor performance. This object is achieved through the use of data fanout, the repeated use of the data, and sequential scheduling of instructions.
It is also an object of this invention to provide a processor input responsive to the token for selecting a sequence of operations to be performed by the processor.
Additional objects, advantages, and novel features of the invention will become apparent to those skilled in the art upon examination of the following description or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
To achieve the foregoing and other objects, and in accordance with the purpose of the present invention, as embodied and broadly described herein, the present invention may comprise, in a dataflow architecture wherein processing instructions are stored in memory, a processing means responsive to a token, wherein the token identifies and causes a set of processing instructions to be executed in a predetermined order. The token comprises an address in a memory location associated with an operation to be performed on data within the token; the token also having an offset representing a displacement of memory locations from the memory location associated with the previous operation, and the token also having a repeat count, wherein the processing cycle on the token comprises executing a first operation on the token's data, decrementing the repeat count, and identifying the subsequent operation to use the data by moving through the memory locations the amount of the offset, and then performing that operation associated with memory location, decrementing the repeat count, and repeating the above steps until the repeat count is nil.
An alternative sequencing and fanout mechanism is provided by the dataflow computer processor and method of dataflow computer processing wherein the means for identifying the location of the immediate successor operation of the sequence of operations is in the current instruction and further comprises the memory location of the successor operation.