Modern computer systems usually include, without limitation, a central processing unit (CPU), a graphics processing unit (GPU) and several input or output devices. Typically, the CPU is designed for executing general purpose software whereas the GPU is specifically optimized for performing 3D rendering computations such as texture mapping or geometric transformations. State of the art CPU microarchitectures include one or more processing cores with the purpose of exploiting both coarse-grain and fine-grain thread-level parallelism (TLP). Coarse-grain parallelism is achieved by concurrently executing several computing tasks or processes in the available general purpose CPU cores. This load balancing distribution is implemented in the operating system (OS) scheduler by assigning to each process an execution time slot for each core. On the other hand, fine-grain TLP is implemented in the hardware of each core and tries to minimize the underutilization of the functional units by simultaneously fetching and executing instructions from different threads. This technique is known as simultaneous multithreading and is extensively used in current CPU designs for maximizing the performance of parallel applications and OS processes.
Examples of workloads that potentially benefit from parallelism are computer vision algorithms, particularly object detection methods. These techniques determine the location of specific objects such as traffic signs, handwritten characters or even human faces within an image or video frame. One of the most widely used methods for performing object detection relies on a boosted cascade of classifiers. This cascade arranges a set of weak classifiers sequentially with the purpose of building an aggregated strong classifier. This approach facilitates the rejection of negative candidates at early stages of the cascade, thus quickly discarding image regions that are not likely to contain the desired objects. Even though the hierarchical nature of the boosted cascade prevents any attempt of inter-stage parallelization, it is still possible to evaluate different image regions in parallel just by assigning them to different CPU threads.
Since these threads need to perform a huge amount of arithmetic operations, the overall object detection latency could be dramatically reduced if the amount of arithmetic and logic units (ALUs) within each CPU core were increased. Unfortunately, the flat memory access model and complex out-of-order execution engines offered by CPUs tends to spend large portions of the chip die in big caches, buffers and speculation logic, thus reducing the available area for additional functional units. Unlike general-purpose CPUs, stream processor microarchitectures such as GPUs try to exploit data-level parallelism (DLP) and adopt a radically different memory hierarchy with small-sized on-die shared memories in which data locality is managed by the programmer. The footprint in terms of spent die area for this approach is much lesser and therefore more transistors are devoted to increase the number of available ALUs. In order to maximize the utilization of these ALUs, modern GPUs implement DLP through Single Instruction Multiple Data (SIMD) instructions that are executed in an array of lightweight multithreaded cores. These cores are organized in clusters in such a manner that data locality and synchronization within the cluster is achieved by using a shared memory.
Heterogeneous microarchitectures combine the characteristics of both CPUs and GPUs, usually in the same chip die. These designs offer a massively parallel multithreaded execution engine that is tightly coupled with one or more general purpose out-of-order processing cores. With the emergence of such technology, there is a need for a parallel object detection method that fully exploits the computing capabilities of the underlying hardware. The efficient usage of both coarse-grain and fine-grain parallelism within all the steps involved during the objection detection process would maximize the occupancy of the available SIMD processing units, thus decreasing the latency of image analysis. This increased detection throughput enables the real-time processing of high resolution images and video frames that feature a large amount of objects (e.g. human faces) in scenarios such as highly crowded environments.