Standard image and signal processing algorithms generally process data within a sample window (e.g. a Gaussian blur). If this window “slides” neatly in raster order, producing this sample window is relatively straightforward and, to a large degree, does not become markedly more difficult as the sample window increases in size (e.g. larger radius Gaussian blurs).
While this works for brute-force algorithms, algorithms that work on large data sets often need perform more specific and targeted computations (for the sake of computational efficiency). In this case, the sample windows that are computed upon are generally arbitrarily-located windows within a larger search space. A similar situation occurs if the analysis is data-dependent. Producing sample windows for these more advanced algorithms is a much more difficult problem.
One solution stores the data representing the larger search space in a buffer and uses a series of random memory accesses into that buffer to gradually construct the required sample. However, this solution creates a bottleneck at the memory, limiting the speed of computation. This is because real memory has limited I/O capacity (i.e. simultaneous read ports), whereby the speed at which “full” samples can be accessed is inversely related to the size of the sample window; (e.g., algorithms with very large sample windows simply cannot run as quickly as those with smaller windows). Another issue is that unless the sample is very small, the entire sample cannot be accessed at once. This strongly limits the amount of parallel computation that can be performed. Notwithstanding, this is the solution that is implemented on a CPU.
An alternative solution avoids the memory bottleneck by creating multiple identical copies of the larger search space using multiple buffers. In this way, parallel data items are produced within the requested window, only limited by the number of parallel buffers. This solution is amenable to a direct hardware implementation. However, having multiple copies comes at a significant resource cost, as essentially the amount of memories needed is proportional to the window size if normalized to a constant performance requirement.