Conventional computer vision implementations are designed to use as little bandwidth and hardware as practical, and strive to maximize an output data rate. In practice, many conventional designs include some form of hardware drawbacks or limitations. A common design approach is to fetch the input data and sum the data values for each window separately. The separate fetches and summations are inefficient. Memory bandwidth is wasted due to redundant fetches. Separately summing up each window also results in a low output data rate, especially with large overlaps and large window sizes. To mitigate the bandwidth issue, other common designs use a row buffer to avoid re-fetching the same data. However, the row buffers utilize significant hardware, thereby limiting the flexibility of window sizes.
It would be desirable to implement calculations of window sums in overlapping windows.