The present invention relates to a method of and an apparatus for detecting a face-like region of a colour image. Such a method may be used in association with other methods for detecting a face in an image and for capturing a target image, for instance during the initialisation stage of an image tracking system which may be associated with an observer tracking autostereoscopic display. Such methods and apparatuses have a wide range of applications, for instance in skin colour detection, face detection and recognition, security surveillance, video and image compression, video conferencing, multimedia database searching and computer games.
The present invention also relates to an observer tracking display, for instance of the autostereoscopic type.
Autostereoscopic displays enable a viewer to see two separate images forming a stereoscopic pair by viewing such displays with the eyes in two viewing windows. Examples of such displays are disclosed in EP 0 602 934, EP 0 656 555, EP 0 708 351, EP 0 726 482 and EP 0 829 743. An example of a known type of observer tracking autostereoscopic display is illustrated in FIG. 1 of the accompanying drawings.
The display comprises a display system 1 co-operating with a tracking system 2. The tracking system 2 comprises a tracking system 3 which supplies a sensor signal to a tracking processor 4. The tracking processor 4 derives from the sensor signal an observer position data signal which is supplied to a display control processor 5 of the display system 1. The processor 5 converts the position data signal into a window steering signal and supplies this to a steering mechanism 6 of a tracked 3D display 7. The viewing windows for the eyes of the observer are thus steered so as to follow movement of the head of the observer and, within the working range, to maintain the eyes of the observer in the appropriate viewing windows. EP 0 877 274 and GB 2 324 428 disclose an observer video tracking system which has a short latency time, a high update frequency and adequate measurement accuracy for observer tracking autostereoscopic displays. FIG. 2 of the accompanying drawings illustrates an example of the system, which differs from that shown in FIG. 1 of the accompanying drawings in that the tracking system 3 comprises a Sony XC999 NTSC video camera operating at a 60 Hz field rate and the tracking processor 4 is provided with a mouse 8 and comprises a Silicon Graphics entry level machine of the Indy series equipped with an R4400 processor operating at 150 Mhz and a video digitiser and frame store having a resolution of 640xc3x97240 picture elements (pixels) for each field captured by the camera 3. The camera 3 is disposed on top of the display 7 and points towards the observer who sits in front of the display. The normal distance between the observer and the camera 3 is about 0.85 meters, at which distance the observer has a freedom of movement in the lateral or X direction of about 450 mm. The distance between two pixels in the image formed by the camera corresponds to about 0.67 and 1.21 mm in the X and Y directions, respectively. The Y resolution is halved because each interlaced field is used individually.
FIG. 3 of the accompanying drawings illustrates in general terms the tracking method performed by the processor 4. The method comprises an initialisation stage 9 followed by a tracking stage 10. During the initialisation stage 9, a target image or xe2x80x9ctemplatexe2x80x9d is captured by storing a portion of an image from the camera 3. The target image generally contains the observer eye region as illustrated at 11 in FIG. 4 of the accompanying drawings. Once the target image or template 11 has been successfully captured, observer tracking is performed in the tracking stage 10.
A global target or template search is performed at 12 so as to detect the position of the target image within the whole image produced by the camera 3. Once the target image has been located, motion detection is performed at 13 after which a local target or template search is performed at 14. The template matching 12 and 14 are performed by cross-correlating the target image in the template with each sub-section overlaid by the template. The best correlation value is compared with a predetermined threshold to check whether tracking has been lost in step 15. If so, control returns to the global template matching step 12. Otherwise, control returns to the step 13.
The motion detection 13 and the local template matching 14 form a tracking loop which is performed for as long as tracking is maintained. The motion detection step supplies position data by a differential method which determines the movement of the target image between consecutive fields and adds this to the position found by local template matching in the preceding step for the earlier field.
The initialisation stage 9 obtains a target image or a template of the observer before tracking starts. The initialisation stage disclosed in EP 0 877 274 and GB 2 324 428 uses an interactive method in which the display 7 displays the incoming video images and an image generator, for example embodied in the processor 4, generates a border image or graphical guide 16 on the display as illustrated in FIG. 5 of the accompanying drawings. A user-operable control, for instance forming part of the mouse 8, allows manual actuation of capturing of the image region within the border image.
The observer views his own image on the display 7 together with the border image which is of the required template size. The observer aligns the midpoint between his eyes with the middle line of the graphical guide 16 and then activates the system to capture the template, for instance by pressing a mouse button or a keyboard key. Alternatively, this alignment may be achieved by dragging the graphical guide 16 to the desired place using the mouse 8.
An advantage of such an interactive template capturing technique is that the observer is able to select the template with acceptable alignment accuracy. This involves the recognition of the human face and the selection of the interesting image regions, such as the eye regions. Whereas human vision renders this process trivial, such template capture would be difficult for a computer, given all possible types of people with different age, sex, eye shape and skin colour under various lighting conditions.
Suwa et al, xe2x80x9cA Video Quality Improvement Technique for Video Phone and Video Conference Terminalxe2x80x9d, IEEE Workshop on Visual Signal Processing and Communications, Sep. 21-22 1993, Melbourne, Australia disclose a technique for detecting a facial region based on a statistical model of skin colour. This technique assumes that the colour and brightness in the facial region lie within a defined domain and the face will occupy a predetermined amount of space in a video frame. By searching for a colour region which consists of image pixels whose colours are within the domain and whose size is within a known size, a face region may be located. However, the colour space domain for the skin colour changes with changes in lighting source, direction and intensity. The colour space also varies for different skin colours. Accordingly, this technique requires calibration of the skin colour space for each particular application and system and is thus of limited applicability.
Swain et al, xe2x80x9cColor Indexingxe2x80x9d, International Journal of Computer Vision, 7:1, pages 11 to 32, 1991 disclose the use of colour histograms of multicoloured objects to provide colour indexing in a large database of models. A technique known as xe2x80x9chistogram back projectionxe2x80x9d is then use to locate the position of a known object such as a facial region, for instance as disclosed by Sako et al, xe2x80x9cReal-Time Facial-Feature Tracking based on Matching Techniques and its Applicationsxe2x80x9d, proceedings of 12 IAPR International Conference on Patent Recognition, Jerusalem, Oct. 6-13 1994, vol II, pages 320 to 324. However, this technique requires knowledge of the desired target, such as a colour histogram of a face, and only works if sufficient pixels of the target image are different from pixels of other parts of the image. It is therefore necessary to provide a controlled background and additional techniques are required to cope with changes of lighting.
Chen et al, xe2x80x9cFace Detection by Fuzzy Pattern Matchingxe2x80x9d, IEEE (0-8186-7042-8), pages 591 to 596, 1995 disclose a technique for detecting a face-like region in an input image using a fuzzy pattern matching method which is largely based on the extraction of skin colours using a model known as xe2x80x9cskin colour distribution functionxe2x80x9d(SKDF). This technique first converts the RGB into a Farnsworth colour space as disclosed in Wyszechi et al, xe2x80x9cColor Sciencexe2x80x9d, John Wiley and Sons Inc, 1982. The SCDF is built by gathering a large set of sample images containing human faces and selecting the skin regions in the images by human viewers. A learning program is then applied to investigate the frequency of each colour in the colour space appearing in the skin regions. The SCDF is then unified and is used to estimate the degree of how well a colour looks like skin colour. Once a region is extracted as a likely skin region, it is subjected to further analysis based on pre-established face shape models, each containing 10xc3x9712 square cells. However, a problem with this technique is that the SCDF can vary as the lighting conditions change.
According to a first aspect of the invention, there is provided a method of detecting a face-like region of a colour image, comprising reducing the resolution of the colour image by averaging the saturation to form a reduced resolution image and searching for a region of the reduced resolution image having, in a predetermined shape, a substantially uniform saturation which is substantially different from the saturation of the portion of the reduced resolution image surrounding the predetermined shape.
The colour image may comprise a plurality of picture elements and the resolution may be reduced such that the predetermined shape is from two to three reduced resolution picture elements across.
The colour image may comprise a rectangular array of Mxc3x97N picture elements, the reduced resolution image may comprise (M/m) by (N/n) picture elements, each of which corresponds to mxc3x97n picture elements of the colour image, and the saturation of each picture element of the reduced resolution image may be given by:   P  =            (              1        /        mn            )        ⁢                  ∑                  i          =          0                          m          -          1                    ⁢                        ∑                      j            =            o                                n            -            1                          ⁢                  f          ⁡                      (                          i              ,              j                        )                              
where f(i,j) is the saturation of the picture element of the ith column and the jth row of the mxc3x97n picture elements. The method may comprise storing the saturations in a store.
A uniformity value may be ascribed to each of the reduced resolution picture elements by comparing the saturation of each of the reduced resolution picture elements with the saturation of at least one adjacent reduced resolution picture element.
Each uniformity value may be ascribed a first value if
(max(P)xe2x88x92min(P))/max(P)xe2x89xa6T
where max(P) and min(P) are the maximum and minimum values, respectively, of the saturations of the reduced resolution picture element and the or each adjacent picture element and T is a threshold, and a second value different from the first value otherwise. T may be substantially equal to 0.15.
The or each adjacent reduced resolution picture element may not have been ascribed a uniformity value and each uniformity value may be stored in the store in place of the corresponding saturation.
The resolution may be reduced such that the predetermined shape is two or three reduced resolution picture elements across and the method may further comprise indicating detection of a face-like region when a uniformity value of the first value is ascribed to any of one reduced resolution picture element, two vertically or horizontally adjacent reduced resolution picture elements and a rectangular two-by-two array of picture elements and when a uniformity value of the second value is ascribed to each surrounding reduced resolution picture element.
Detection may be indicated by storing a third value different from the first and second values in the store in place of the corresponding uniformity value.
The method may comprise repeating the resolution reduction and searching at least once with the reduced resolution picture elements shifted with respect to the colour image picture elements.
The saturation may be derived from red, green and blue components as
(max(R,G,B)xe2x88x92min(R,G,B))/max(R,G,B)
where max(R,G,B) and min(R,G,B) are the maximum and minimum values, respectively, of the red, green and blue components.
The method may comprise capturing the colour image. The colour image may be captured by a video camera and the resolution reduction and searching may be repeated for different video fields or frames from the video camera. A first colour image may be captured while illuminating an expected range of positions of a face, a second colour image may be captured using ambient light, and the second colour image may be subtracted from the first colour image to form the colour image.
According to a second aspect of the invention, there is provided an apparatus for detecting a face-like region of a colour image, comprising a data processor arranged to reduce the resolution of the colour image by averaging the saturation to form a reduced resolution image and to search for a region of the reduced resolution image having, in a predetermined shape, a substantially uniform saturation which is substantially different from the saturation of the portion of the reduced resolution image surrounding the predetermined shape.
According to a third aspect of the invention, there is provided an observer tracking display including an apparatus according to the second aspect of the invention.
It is known that human skin tends to be of uniform saturation. The present method and apparatus make use of this property and provide an efficient method of finding candidates for faces in colour images. A wider range of lighting conditions can be accommodated without the need for colour calibration so that this technique is more reliable and convenient than the known techniques. By reducing the resolution of the saturation of the image, computational requirements are substantially reduced and a relatively simple method may be used. Averaging increases the uniformity of saturation in a face region so that this technique is capable of recognising candidates for faces in images of people of different ages, sexes and skin colours and can even cope with the wearing of glasses of light colour. Because this technique is very efficient, it can be implemented in real time and may be used in low cost commercial applications.
This technique may be used in the initial stage 9 shown in FIG. 3 of the accompanying drawings for the image tracking system disclosed in EP 0 877 274 and GB 2 324 428. Further, this technique may be used as the first part of a two stage face detection and recognition techniques as disclosed, for instance in U.S. Pat. Nos. 5,164,992, 5,012,522, Turk et al xe2x80x9cEigen faces for Recognitionxe2x80x9d, Journal 1 of Cognitive Neuroscience, vol 3, No 1, pages 70 to 86, 1991, Yuille et al, xe2x80x9cFeature Extraction from Faces using Deformable Templatesxe2x80x9d, International Journal of Computer Vision, 8(2), pages 99 to 111, 1992 and Yang et al, xe2x80x9cHuman Face Detection in Complex Backgroundxe2x80x9d, Pattern Recognition, vol 27, No 1, pages 53 to 63, 1994. In such two stage techniques, the first stage locates the approximate position of the face and the second stage provides further analysis of each candidate""s face region to confirm the existence of the face and to extract accurate facial features such as eyes, nose and lips. The first stage does not require high accuracy and so may be implemented with fast algorithms. The number of image regions which have to be analysed in the second stage is limited by the first stage. This is advantageous because the second stage generally requires more sophisticated algorithms and is thus more computing-intensive.