Current dataflow frameworks express linear output of data from one process feeding into another process. Most dataflow computations are acyclic, meaning that the operators are sequenced in a linear order so that the inputs of each operator come from the outputs of “previous” operators. Running the operators in the linear order ensures all operators have input available. In previous implementations designed for Directed Acyclic Graphs, (DAGs), iteration of a sub-graph is conventionally achieved by repeating that sub-graph, once per iteration count (‘loop unrolling’). In loop unrolling implementations, the graph is static and constructed in advance of compute time. Hence the number of iterations is static and DAGs cannot support iteration loops that terminate based on a data-dependent termination-criterion being met. Moreover, graph size grows with the number of iterations. For nested iteration (i.e., loops within loops), the growth in graph size is proportional to the product of the loop lengths (i.e., the number of graph vertices in each loop scope).
Cyclic, or iterative, graphs on the other hand, typically require problem-dependent knowledge to schedule because the vertices may not be ordered, and inputs may not be fully formed before the operator runs.
When it comes to heterogeneous systems, that is systems having a variety of processors and hardware logic, currently, a programmer has to write extra code in order for a process on an accelerator device such as a graphics processing unit (GPU) or a field programmable gate array (FPGA) to accept data output from a second device such as a computer processing unit (CPU) or a different GPU or FPGA. Often, writing code for memory synchronization and communication monopolizes the time of the programmer, without contributing to the kind of computation that the programmer is seeking to express. In addition to monopolizing the time of the programmer, if less well written or not well-maintained when the dataflow graph is modified, this synchronization and communication code can cause bugs or performance degradation.
Programming systems for GPUs typically rely on vendor-provided tools that require programmers to write code that explicitly controls the movement of data to and from the GPU, which is time consuming and error prone. This is in addition to the programmers writing the code that runs on the GPU itself to accomplish the task for which the programmer chose the GPU.
Because existing dataflow systems rely on explicit control, programmers using these systems sacrifice performance, modularity, and reusability. Thus, coupling dataflow code with algorithm code presents a barrier to providing higher-level programming environments for programming GPUs and other accelerators. Existing dataflow execution engines such as MAPREDUCE, DryadLINQ, and PTask sacrifice flexibility for simplicity, making iteration and recursion difficult to express and support.
Conventional dataflow systems, with an objective to design an instruction set that will execute on a homogeneous processor architecture specifically designed for dataflow, address iteration by defining special control-flow/iteration operators, which are inserted into the dataflow graph as first-class vertices, around a sub-graph of vertices representing the computation to be executed within the loop. However, when applied to a heterogeneous system in which existing vertices represent relatively long-running activities, adding into a graph vertices that do not correspond to computations of the workload is costly in terms of performance and consumption of memory and other system resources, including threads.