In processing devices with decoupled architectures, memory access and computation are performed by separate (decoupled) hardware modules. For general purpose computing, the hardware may be further decoupled by introducing a control processing module in addition to the memory access and computation modules.
Streaming applications produce interim data; data that is produced and consumed by pairs of hardware accelerators. Prior approaches use a memory mapped buffer as a peripheral to store interim data that is generic to all interim data, and therefore not efficient for any access pattern. Alternative approaches include the use of external memory (DRAM) to store interim data. All of these approaches require extra bus ports, which lead to lower bus speeds and larger gate counts.
In one prior design approach for devices with a decoupled architecture, a data-flow-graph (DFG) is used to define the computation and a set of stream descriptors are used to define data access patterns. This approach has the ability to generate hardware automatically from the DFG and stream descriptors. In addition, some efforts have been made to develop tools that allow programs developed in high level languages, such as C/C++, to be converted into hardware (for example by programming the gates of an FPGA). The generated hardware tends to be inefficient unless the high level language includes features, such as memory access threads and computation threads, with the flexibility to describe both the computational task and data movement. These features allow streaming data access to memory and/or other hardware accelerators in a computation pipeline.
The use of high level languages for hardware programming aids software engineers who do not have system architecture or hardware expertise but may be familiar with high level languages (HLL's) such as C/C++ that are used to program embedded systems with DSPs or microcontrollers.
Stream descriptors have been used to access data in memory as streams and have also been used to generate stream data interface logic. In contrast, interim data storage of data being moved between computational modules has been handled by memory mapped buffers and/or first in first out (FIFO) buffers.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the elements in the figures may be simplified to aid understanding of embodiments of the present invention.