This invention relates to a SIMD (Single Instruction Multiple Data) processor and method for performing object detection in an image, such as for example a face.
Many modern methods for performing automatic face detection are based on the Viola-Jones object detection framework which is described in the paper by P. Viola and M. Jones: “Robust realtime face detection”, International Journal of Computer Vision, vol. 57, no. 2, pp. 137-154, 2004. The Viola-Jones framework operates on a set of image regions or “subwindows” defined for an image, each subwindow having a different location, scale or angle of rotation within the image so as to allow faces at different locations, or of different sizes and angles of rotation to be detected. A cascaded set of binary classifiers operates on each subwindow so as to detect whether the subwindow is likely to bound a face in the image. Each binary classifier is a test performed on a subwindow in order to determine whether the subwindow satisfies one or more simple visual features (often referred to as “Haar-like features”). If a subwindow satisfies the one or more simple visual features, the binary classifier passes the subwindow and moves onto the next binary classifier in the cascade. When all of the binary classifiers of a cascade pass a subwindow, that subwindow becomes a candidate for a face in the image being searched. If any of the binary classifiers in a cascade reject a subwindow, then no further processing is performed on that subwindow, the cascade terminates and cascade processing begins again on the next subwindow.
Four visual features 201-204 which are typically used in the Viola-Jones framework are shown in FIG. 2. Each of the features 201 to 204 shown in FIG. 2 visually represent how to process the pixel values of a subwindow in order to test a subwindow. For example, feature 201 might represent a component of a first binary classifier in a cascade of classifiers and is calculated by subtracting the sum of the pixel values of the subwindow lying in the shaded area of the feature (the right-hand side of the subwindow) from the sum of the pixel values of the subwindow lying in the unshaded area of the feature (the left-hand side of the subwindow). If the feature evaluates to a value which exceeds a predefined threshold (typically established by training the face detector on test images), the subwindow is deemed to satisfy the visual feature and the binary classifier passes the subwindow. Typically the binary classifiers operate on subwindows of an image which has been processed so as to represent only luminance or brightness information (e.g. the pixel values can be luma values). In this manner, the binary classifiers of a cascade act so as to identify particular patterns of contrast in an image which are indicative of facial features.
In order to improve the performance of a face detection system, the binary classifier operations performed according to the Viola-Jones object detection framework can be performed in parallel at a graphics processing unit (GPU) by allocating groups of threads to the GPU. However, this approach can lead to the parallel processing elements of the GPU being idle for a significant proportion of time. This is because the parallel processing of a group of threads will generally not complete until the processing of every one of its threads has completed and any given thread operating on a subwindow could terminate almost immediately if the subwindow fails the first binary classifier of the cascade, or it could complete processing of all of the binary classifiers of the cascade should the subwindow represent a face candidate. This underutilisation of processing resources presents a hurdle to performing face detection in real-time using the Viola-Jones framework, especially on mobile and low-power platforms.
Previous efforts to address this issue have attempted break up the performance of binary classifiers into stages, such as the Nvidia CUDA implementation described at pages 534-541 of “GPU Computing Gems” by Wen-mei W. Hwu, Elsevier Inc., 2011. However, this only partly addresses the issue, introduces an additional overhead for compacting data between stages, and has the disadvantage that it is inefficient during the early stages of processing.