US 2002/0102024, hereinafter Viola-Jones discloses a method for detecting a region of interest (ROI) comprising an object such as a face within an acquired image, usually an image frame in a video stream. In brief, Viola-Jones first derives an integral image from the acquired image. Each element of the integral image is calculated as the sum of intensities of all points above and to the left of the point in the image. The total intensity of any sub-window in an image can then be derived by subtracting the integral image value for the top left point of the sub-window from the integral image value for the bottom right point of the sub-window. Intensities for adjacent sub-windows can be efficiently compared using particular combinations of integral image values from points of the sub-windows.
Object detectors based on Viola-Jones, use a chain (cascade) of, for example, 32 pre-trained classifiers based on rectangular (and increasingly refined) Haar features with the integral image by applying the classifiers to a candidate sub-window within the integral image. For a complete analysis of a scan area within an acquired image, this sub-window is shifted incrementally across the integral image until the scan area has been covered.
It will be appreciated that applying Viola-Jones analysis to every portion of an image for every size of object to be detected can still be quite processor intensive and this could prevent a system operating quickly enough to detect and track an object across a stream of images in real time.
Thus, many improvements of this basic technique have been developed. For example, PCT Application WO2008/018887 (Ref: FN-143), the disclosure of which is incorporated by reference, discloses an image processing apparatus for tracking faces in an image stream. Each acquired image of the stream is sub-sampled at a specified resolution to provide a sub-sampled image. Fixed size face detection is applied to at least a portion of the integral image to provide a set of candidate face regions. Responsive to the set of candidate face regions produced and any previously detected candidate face regions, the resolution is adjusted for sub-sampling a subsequent acquired image.
There remains a need however for a more efficient mechanism for tracking one or more objects across a stream of images.