Computer vision applications are becoming increasingly important as machine intelligence is being used to solve complex problems in fields ranging from remote sensing to medical data analysis to law enforcement and security. There is a great need for a scalable high-performance framework for processing computer vision workloads, with scalable and efficient algorithms for processing images, videos, and other types of unstructured data.
Integral image computation, sometimes referred to as summed area tables, is a critical component in computer vision computations, and is used in several computer vision applications. Determining the integral image of an input image facilitates other computer vision computations involving stereo vision, feature tracking, edge detection, image filtering, and object detection, among others. Hence, improving performance in computing the integral image has a direct impact on the performance of other computer vision applications.
Integral image computation involves determining the cumulative sum of all the pixels from the top left pixel of an input image to the bottom right pixel of the image. One approach to improving performance in computing the integral image involves parallelizing the computations.
Prior efforts to efficiently parallelize integral image computations use specialized embedded systems, or graphics processing units (GPUs). The approaches using GPUs either optimize memory accesses while ignoring the degree of parallelism achieved or aim for work-efficiency. The performance of GPU-based approaches are also mostly unaffected by regular non-sequential accesses across threads (i.e., with strided memory accesses). This is because GPU hardware inherently coalesces such memory accesses, and thereby provides the same benefits as sequential accesses. Modern commodity CPU hardware, however, still experience performance deterioration with regular non-sequential memory accesses. The challenge in integral image computation is to minimize non-sequential memory accesses, increase the degree of parallelism, and maintain a certain level of work efficiency.
Described herein are approaches for performing integral image computation in parallel across a large number of core processors with a maximum degree of parallelism without compromising work-efficiency, while fully utilizing available memory bandwidth and limiting non-sequential memory accesses to a minimum.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.