Field of the Invention
Embodiments of the present invention relate generally to computer graphics and, more specifically, to performing object detection operations via a graphics processing unit (GPU).
Description of the Related Art
Automated real-time detection of objects (e.g., faces, pets, logos, pedestrians, etc.) in images is a well-known mid-level operation in computer vision that is the enabler for many higher level computer vision operations. For instance, object detection is a precursor to tracking, scene understanding and interpretation, content based image retrieval, etc. Many computer systems configured to implement conventional object detection rely on a central processing unit (CPU). To detect whether a particular object is included in an image, the CPU typically performs two general steps. First, in a training step, the CPU uses “positive” images of the object and “negative” images of non-objects to train a statistical pattern classifier. Second, in an execution step, the CPU applies the trained pattern classifier to each pixel of an input image to determine whether a window (i.e., region) surrounding the pixel corresponds to the object. Further, to find the object at multiple scales, the CPU scales the input image to different sizes and applies the pattern classifier to each scaled image. Consequently, the CPU performs the same set of object-detection operations on a very large number of pixels across multiple scaled images.
To optimize the performance of object detection, many CPUs are configured to implement an algorithm known as a cascaded adaptive boosting classifier algorithm (CABCA). In the CABCA approach, a cascaded classifier includes a series of smaller classifiers, often of sequentially increasing complexity, that the CPU applies to each pixel in a series of discrete stages. At each stage, if the CPU determines that a particular pixel does not correspond to the object, then the CPU stops processing the pixel and begins processing the next pixel. As a result of “early terminations,” the number of smaller classifiers that the CPU applies to each pixel is reduced for pixels that are not associated with the object.
Increasingly, advanced computer systems include one or more graphics processing units (GPUs), capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. The specialized design of such parallel processing subsystems usually allows these subsystems to efficiently perform certain tasks using a high volume of concurrent computational and memory operations. Because object detection involves performing a high volume of object-detection operations that may be executed concurrently across pixels and images, many advanced computer systems leverage the GPU to perform these operations. However, due to the sequential nature of the cascaded classifier and the differing number of classifiers applied to each pixel, the CABCA approach to object detection does not fully leverage the processing capabilities of GPUs.
For example, suppose that a first pixel of an image were associated with the object, but the second pixel of the image were not associated with the object. Further, suppose that the GPU were to process the image using a cascaded classifier that included 16 smaller classifiers. Finally, suppose that a first processing unit within the GPU were to determine that the first pixel was not associated with the object based on the first smaller classifier. In such a scenario, the first processing unit would cease processing the first pixel and, consequently, would be idle until the processing unit assigned to the second pixel applied the 15 remaining smaller classifiers included in the cascaded classifier to the second pixel. Since the number of processing units included in the GPU is limited, idle processing units reduce the efficiency of the GPU and limit the speed at which the computer system performs object detection.
Accordingly, what is needed in the art is a more effective technique for performing object detection operations via parallel processing architectures.