This invention relates in general to field of digital image processing and computer vision, and more particularly to methods for detecting human figures in a digital image.
Digital images are widely used in image communication. One opportunity opened by digital images is that it becomes possible to use computer vision techniques to detect objects of interest in images. Among all conceivable objects found in images, human figures stand out as the one of the highest general interest.
There have been extensive research and development activities over the past two decades on human face detection. For example, in U.S. Pat. No. 5,835,616, issued Nov. 11, 1998 to Lobo. Lobo discloses a two step process for automatically finding a human face in an electronically digitized image (for example, taken by handheld digital cameras and digital video-cameras such as camcorders), and for confirming the existence of the face by examining facial features. Step 1 is to detect the human face and is accomplished in stages that include enhancing the digital image with a blurring filter and edge enhancer in order to better set forth the unique facial features such as wrinkles, and curved of a facial image. After prefiltering, preselected curves sometimes referred to as snakelets are dropped on the image where they become aligned to the natural wrinkles and curves of a facial image. Step 2 is to confirm the existence of the human face in seven stages by finding facial features of the digital image encompassing the chin, sides of the face, virtual top of the head, eyes, mouth and nose of the image. Ratios of the distances between these found facial features can be compared to previously stored reference ratios for recognition. This method for detecting facial features in an image can be used in applications such as but not limited to detecting human faces for the gathering of population age-statistics from patrons at entertainment/amusement parks and television network viewer-rating studies. Such gathering can include counting the patrons, distinguishing certain age and gender groups, and/or identifying specific people. Computer vision with this capability can further have application in such fields as automated surveillance systems, demographic studies, automated photography for point-and-shoot cameras and human computer interactions. Automated photography can eliminate the manual adjustment problems that result in poor quality from lack of focused subjects. Computer systems can utilize this system to recognize and respond to the specific needs of a user, and further translate for human users.
The value of xe2x80x9cface detectionxe2x80x9d in various applications is already known. However, xe2x80x9cperson detectionxe2x80x9d or xe2x80x9chuman figure detectionxe2x80x9d could potentially give yet more information, for two reasons: person detection encompasses more than just the face, and person detection can be successful in situations where face detection is not.
Main subject detection, exposure compensation for subject, and image compositing would all benefit from knowledge of person-regions instead of only face regions. In a picture of a person, the main subject is usually not just the face, but the whole person. For digital editing, it is also quite reasonable to insist that the whole person be treated as a unit in compositing or zooming and cropping rather than working with disembodied faces. And in exposure compensation, it may be argued that proper compensation for a subject should include consideration not only of the face but also the associated hair and clothing.
Face detection can be expected to fail when the face in the photograph is xe2x80x9ctoo small,xe2x80x9d perhaps on the order of a 10 pixel eye-to-eye distance, or out of focus. For such pictures some types of person-detector may still succeed. In that case person-detection may be considered a replacement for face-detection in applications where face detection would otherwise be helpful, such as frame orientation determination and main subject detection.
The ideal person detector would label each pixel of an image according to whether or not it is part of a person, and if so which person it is associated with. Pixels associated with a person include the body and hair and worn clothingxe2x80x94basically anything that moves as a unit with the person. Person detection should be successful regardless of pose, posture, cropping, occlusion, costume, or other atypical circumstances. Objects held in the handsxe2x80x94an umbrella, a bag, a babyxe2x80x94are a gray area and may be included or excluded depending on the specific application.
It should be apparent that this is a hard problem. It encompasses close-up xe2x80x9chead and shoulderxe2x80x9d views of one or two persons, to medium-range group pictures of seated persons partially occluding standing persons, to distant crowds composed of many mostly-occluded persons, possibly with backs turned to the camera.
A few approaches known in the prior art and dealing with similar problems include the following:
Oren et al. disclosed a method for pedestrian detection using wavelet-based templates in the Proceedings of Computer Vision and Pattern Recognition, 1997. The method is based on template matching, which refers to applying a predetermined intensity pattern (xe2x80x9ctemplatexe2x80x9d) across the image for all locations and possible sizes of the actual object (xe2x80x9cpedestrianxe2x80x9d). Wavelet templates are used to reduce the sensitivity to variations in subject clothing and lighting conditions. It is only suitable for xe2x80x9cpedestrianxe2x80x9d, i.e., low-detail figures in walking gesture. It is also computationally expensive because of the exhaustive search for all locations and sizes.
Forsyth et al. disclosed a method for xe2x80x9cnaked peoplexe2x80x9d detection using skin detection and limb grouping (David Forsyth, Margaret Fleck, and Chris Bregler, xe2x80x9cFinding Naked Peoplexe2x80x9d, 1996 European Conference on Computer Vision, Volume II, pp. 592-602.). They first locate images containing large areas of skin-colored region, and then find elongated regions and group them into possible human limbs and connected groups of limbs. The assumptions are:
humans are made of parts whose shape is relative simple;
there are few ways to assemble these parts;
the kinematics of the assembly ensures that many configurations are impossible; and
when one can measure motion, the dynamics of these parts are limited.
They use the following model:
skin regions lack texture and have a limited range of hues and saturation;
grouping rules to assemble simple groups (body segments) into complex groups (limb-segment girdles), incorporating constraints on the relative positions of 2D features, induced by geometric and kinematic constraints on 3D body parts;
grouping are performed on edge segments: pairs of edge points with a near-parallel local symmetry and no other edges in between; sets of points forming regions with roughly straight axes (xe2x80x9cribbonsxe2x80x9d);
pairs of ribbons whose ends lie close together, and whose cross-sections are similar in length, are group together to make limbs;
limbs are grouped together into putative girdles; and
segments are grouped to form spine-thigh group.
The problems with this method are:
some suggested grouping rules are not complete; and
clothed people are hard to segment because clothing is often marked with complex patterns, subject to distortion caused by changes in surface orientation.
Felzenszwalb and Huttenlocher disclosed a method for human figure matching using a deformable model in the Proceedings of Computer Vision and Pattern Recognition, 2000. This method is based on matching of a deformable model represented by spring-like connections between pairs of parts. The human figure model is the following: each part fits a rectangular box of the same intensity; each connection specifies the cost associated with deviations in each of the relative orientation, size, and joint alignment.
The problems with this method are:
detection is limited to matching a known human figure with known parts (the model has to be built for the expected person in the image;
parts are not obtained in a robust fashion (possible part locations are quantized into 50 buckets for each of the x and y positions, 10 buckets for size and 20 buckets for orientation);
matching result is a rough location of the person and is inadequate for many applications.
The method taught by Oren et al. assumes that the human figure is stand-up position (pedestrian), full-figure with no or little occlusion.
The method taught by Forsyth et al. is based on a number of assumptions, such as:
The human figure is naked;
All the human body parts can be detected as skin regions; and
All background regions are not detected as skin regions.
The method taught by Felzenszwalb et al. is designed primarily for matching of a known human figure rather than detection of an unknown human figure. It is also based on a number of assumptions, such as:
The human figure model is pre-specified and does not change;
All the exposed human body parts can be detected as uniformly skin colored regions; and
All the clothing parts are detected as uniformly colored regions.
These assumptions, however, may not hold for many image applications. For example, in most applications, it is not feasible to build a model of the human figure before the search, or restrict the pose to a stand-up position. Also, in most application, people would wear some kind of clothing.
There is a need therefore for a more efficient algorithm that detects generic human figures in an image without making any assumption of the pose, posture, cropping, occlusion, costume, or other atypical circumstances. The only assumptions are that the image is of reasonable quality so that different regions can be discerned, and that the human figures are of reasonable sizes so that body parts can be segmented.
According to the present invention, there is provided a solution to the problems of the prior art. The need is met according to the present invention by providing a digital image processing method for detecting human figures in a digital color image having pixels representing RGB values, comprising the steps of: segmenting the image into non-overlapping regions of homogeneous color or texture; detecting candidate regions of human skin color; detecting candidate regions of human faces; and for each candidate face region, constructing a human figure by grouping regions in the vicinity of the face region according to a predefined graphical model of the human figure, giving priority to human skin color regions.
According to a feature of the present invention, there is provided a digital image processing method for detecting human figures in a digital color image having pixels representing RGB values, comprising the steps of:
providing a digital color image having pixels representing RGB values;
segmenting the digital color image into non-overlapping regions of homogeneous color or texture;
detecting candidate regions of human skin color;
detecting candidate regions of human faces; and
for each candidate face region, constructing a human figure by grouping regions in the vicinity of the face region according to a pre-defined graphical model of the human figure, giving priority to human skin color regions.
The present invention has the advantage that clothed, unknown human figures can be more reliably detected in a digital image.