The present invention applies to the field of parallel processing devices composed of multiple processing elements, where each processing element executes multiple threads. More particularly, the present invention relates to improved, or ideally, maximized memory throughput on parallel processing devices for streaming computations.
Streaming computations, where the elements of a stream can be processed independently of each other, are especially well suited to a parallel processing device. Each processing element, and each thread, of the processing device can read in stream elements, process them, and store out the results without necessitating communication between threads of processing elements, or processing elements in general. In a streaming computation on a parallel processing device, each data element of a stream is read from a memory coupled to the processing device, combined with other data through appropriate logic or arithmetic operations, and the result is stored back into memory. Examples of such streaming computations are operations in BLAS (Basic Linear Algebra Subroutines), an industry-standard way of dealing with vectors and matrices, subdivided into BLAS-1 functions (vector-vector), BLAS-2 (matrix-vector), and BLAS-3 (matrix-matrix).
Streaming computations can be subdivided into two classes: those that benefit from caching, and those that do not benefit from caching. For example, a matrix multiply computation can benefit from caching, whereby private memories on chip, for example, provide a software-controlled cache for the computation.
On the other hand, streaming computations that do not benefit from caching, e.g., MPEG stream computations, are such because each piece of data is basically used once in the computation. Generally, the computation involves reading first data for a portion of the stream from memory, performing a computation on the first data, returning a result back to memory, reading second data for a second portion of the stream, computing on the second portion, storing results for the second portion, and so on. In this computational model, there is no data reuse, negating any advantages that could be had from caching.