1. Field of the Invention
The present disclosure relates to a method and systems for detecting one or more faces within a digital image.
2. Description of the Related Art
Detection of faces or heads in images is an ability in video conferencing systems and other types of video systems. Systems having a video image capturing device (e.g., video cameras, video conferencing equipment, web cameras, or the like) facilitate functionality such as, but not limited to, optimum view definition, area targeting for focusing purposes (to ensure that the individuals in the video are in focus), color optimization (to ensure correct face colors), or the like.
Face detection requires the face detecting device (or logic) to examine or process thousands, if not millions, of candidate windows within one digital image in an effort to locate portion(s) of a video frame (or image) that may contain a human face. Conventional techniques call for the image data within the candidate windows to be manipulated and examined in various different positions and/or scales. Such processing can lead to slow detection speeds.
A variety of face detection techniques are known and continue to be adapted and improved upon. The following is a list of some exemplary references, the entire contents of which are hereby incorporated by reference:                [1] P. Viola and M. Jones. “Robust real time object detection.” IEEE ICCV Workshop on Statistical and Computational Theories of Vision, Vancouver, Canada, Jul. 13, 2001.        [2] A. Pentland, B. Moghaddam, and T. Starner. “View-based and Modular Eigenspaces of Face Recognition.” Proc. of IEEE Computer Soc. Conf. on Computer Vision and Pattern Recognition, pp. 84-91, June 1994. Seattle, Wash.        [3] M. Bichsel and A. P. Pentland. “Human face recognition and the face image set's topology.” CVGIP: Image Understanding, 59:254-261, 1994.        [4] R. E. Schapire. “The boosting approach to machine learning: An overview.” MSRI Workshop on Nonlinear Estimation and Classification, 2002.        [5] T. Serre, et al. “Feature selection for face detection.” AI Memo 1697, Massachusetts Institute of Technology, 2000.        [6] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., New York, 1998.        [7] Y. Freund and R. E. Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting.” Journal of Computer and System Sciences, 55(1):119-139, August 1997.        
Several approaches exist for the detection of faces in images. One of the faster methods today, developed for single-frame analysis, is the cascaded classifier using Haar similar features, found in P. Viola and M. Jones, “Robust real time object detection,” listed above. Viola and Jones use a series (termed a cascade) of trained classifiers. These cascades are trained on large sets of images, both with and without faces, (termed positive and negative samples) to learn distinguishing features of a face. When applied to an image (in this case, a single frame from a video), each classifier from the cascade is applied to regions (or windows) of the image, where the size of the window increases for each iteration. In the Viola and Jones method, the detector is based on local geometric features in a gray-level image of a scene. One typical classifier, for example, is dark eye-caves compared to brighter surroundings, or the like. However, the method of Viola and Jones only considers facial features for each window (or region), and needs to process each region for facial features before it determines if it contains a face or not. As a result, there is a high processing load on the system performing the method because detailed analysis must be performed on an image even in regions where, for example, color may suggest that no face exists.
A conventional method of reducing the number of candidate windows that need to be processed and examined is by defining a set of colors (i.e., face colors) presumed to be colors found in regions of the image representing a face. Thus, the face detection unit may only process and examine the parts of the image containing pixels having colors corresponding with the defined set of face colors. However, numerous video systems, such as, but not limited to, video conferencing equipment, are typically placed in a variety of different environments with very different illumination and lighting conditions. Video conferencing endpoints are often placed on desktops near windows (giving varied illumination even if the system remains stationary), in well and poorly lit meeting rooms, in large lecture halls, in conference rooms with skin toned furniture or walls, or the like.
Therefore, despite the value of the region color for classifying faces, the variability in measured skin color between different illuminations is quite large, making it difficult to utilize. Additionally, in images containing skin colored walls or furniture, the face detection logic would still spend computational time on large areas not containing faces. Therefore, it is impossible to know the actual color of skin in an image before a reliable face detection is achieved.
Further, in conventional systems, the explicit use of the skin color for face detection, and the dependency of the registered color on the used illumination make it difficult to make a robust detector.