Heterogeneous computing leverages various kinds of computing elements to accelerate applications. In domains such as computer vision and machine learning, it is a common practice to pipeline computations across multiple stages. From the perspective of data flow, input/output (I/O) pipelining is commonly encountered to transfer computational results from one stage to another, e.g., output of operation A is used as an input of operation B. At each stage, each operation can be expressed by multiple kernel functions, in which each function represents a series of computations performed on a specific computing element. As an example, a memory location may be modified by a central processing unit (CPU) function at the first stage, and used as an input for a graphics processing unit (GPU) function at the second stage. The existing heterogeneous computing runtime has no knowledge of the data flow between stages. Thus, unnecessary data is copied back and forth between memory devices for use by various computing elements. In addition, the existing data synchronization mechanism between a host and a computing element is rigid in the sense that a host memory is always involved.