The present invention to a low bit-rate communication system for multimedia applications, such as a video teleconferencing system, and more particularly, to a method of, and system for, identifying skin areas in video images.
The storage and transmission of full-color, full-motion images is increasingly in demand. These images are used, not only for entertainment, as in motion picture or television productions, but also for analytical and diagnostic tasks such as engineering analysis and medical imaging.
There are several advantages to providing these images in digital form. For example, digital images are more susceptible to enhancement and manipulation. Also, digital video images can be regenerated accurately over several generations with only minimal signal degradation.
On the other hand, digital video requires significant memory capacity for storage and equivalently, it requires a high-bandwidth channel for transmission. For example, a single 512 by 512 pixel gray-scale image with 256 gray levels requires more than 256,000 bytes of storage. A full color image requires nearly 800,000 bytes. Natural-looking motion requires that images be updated at least 30 times per second. A transmission channel for natural-looking full color moving images must therefore accommodate approximately 190 million bits per second. However, modern digital communication applications, including videophones, set-top-boxes for video-on-demand, and video teleconferencing systems have transmission channels with bandwidth limitations, so that the number of bits available for transmitting video image information is less than 190 million bits per second.
As a result, a number of image compression techniques such as, for example, discrete cosine transformation (DCT) have been used to reduce the information capacity required for the storage and transmission of digital video signals. These techniques generally take advantage of the considerable redundancy in any natural image, so as to reduce the amount of data used to transmit, record, and reproduce the digital video images. For example, if the video image to be transmitted is an image of the sky on a clear day, the discrete cosine transform (DCT) image data information has many zero data components since there is little or no variation in the objects depicted for such an image. Thus, the image information of the sky on a clear day is compressed by transmitting only the small number of non-zero data components.
One problem associated with image compression techniques, such as discrete cosine transformation (DCT) is that they produce lossy images, since only partial image information is transmitted in order to reduce the bit rate. A lossy image is a video image which contains distortions in the objects depicted, when the decoded image content is compared with the original image content. Since most video teleconferencing or telephony applications are focused toward images containing persons rather than scenery, the ability to transmit video images without distortions is important. This is because a viewer will tend to focus his or her attention toward specific features (objects) contained in the video sequences such as the faces, hands or other skin areas of the persons in the scene, instead of toward items, such as, for example, clothing and background scenery.
In some situations, a very good rendition of facial features contained in a video sequence is paramount to intelligibility, such as in the case of hearing-impaired viewers who may rely on lip reading. For such an application, decoded video image sequences which contain distorted facial regions can be annoying to a viewer, since such image sequences are often depicted with overly smoothed-out facial features, giving the faces an artificial quality. For example, fine facial features such as wrinkles that are present on faces found in an original video image tend to be erased in a decoded version of a compressed and transmitted video image, thus hampering the viewing of the video image.
Several techniques for reducing distortions in skin areas of images that are transmitted have focused on extracting qualitative information about the content of the video images including faces, hands and the other skin areas of the persons in the scene, in order to code such identified areas using fewer data compression components. Thus, these identified areas are coded and transmitted using a larger number of bits per second, so that such areas contain fewer distorted features when the video images are decoded.
In one technique, a sequence of video images is searched for symmetric shapes. A symmetric shape is defined as a shape which is divisible into identical halves about an axis of symmetry. An axis of symmetry is a line segment which divides an object into equal parts. Examples of symmetrical shapes include squares, circles and ellipses. If the objects in a video image are searched for symmetrical shapes, some of the faces and heads shown in the video image are identifiable. Faces and heads that are depicted symmetrically, typically approximate the shape of an ellipse and have an axis of symmetry vertically positioned between the eyes, through the center of the nose and halfway across the mouth. Each half-ellipse is symmetric because each contains one eye, half of the nose and half of the mouth. However, only those faces and heads that are symmetrically depicted in the video image are recognizable, precluding the identification of heads and faces when viewed in profile (turned to the left or turned to the right), since a face or head viewed in profile does not contain an axis of symmetry. Hands and other skin areas of the persons in the scene are similarly not symmetric objects and are also not recognizable using a symmetry based technique.
Another technique, searches the video images for specific geometric shapes such as, for example, ellipses, rectangles or triangles. Searching the video images for specific geometric shapes can often locate heads and faces, but still cannot identify hands and other skin areas of persons in the scene, since such areas are typically not represented by a specified geometric shape. Additionally, partially obstructed faces and heads which do not approximate a specified geometric shape are similarly not recognizable.
In yet another technique, a sequence of video images is searched using color (hue) to identify skin areas including heads, faces and hands. Color (hue) based identification is dependent upon using a set of specified skin tones to search the video sequences for objects which have matching skin colors. While the color (hue) based techniques are useful to identify some hands, faces or other skin areas of a scene, many other such areas can not be identified since not all persons have the same skin tone. In addition, color variations in many skin areas of the video sequences will also not be detectable. This is because the use of a set of specified skin tones to search for matching skin areas precludes color based techniques from compensating for unpredictable changes to the color of an object, such as variations attributable to background lighting and/or shading.
Accordingly, skin identification techniques that identify hands, faces and other skin areas of persons in a scene continue to be sought.
The present invention is directed to a skin area detector for identifying skin areas in video images and, in an illustrative application, is used in conjunction with the video coder of video encoding/decoding (Codec) equipment. The skin area detector identifies skin areas in video frames by initially analyzing the shape of all the objects in a video sequence to locate one or more objects that are likely to contain skin areas. Objects that are likely to contain skin areas are further analyzed to determine if the picture elements (pixels) of any such object or objects have signal energies characteristic of skin regions. The term signal energy as used herein refers to the sum of the squares of the luminance (brightness) parameter for a specified group of pixels in the video signal. The signal energy includes two components: a direct current (DC) signal energy and an alternating current (AC) signal energy. The color parameters of objects with picture elements (pixels) that have signal energies characteristic of skin regions are then sampled to determine a range of skin tone values for the object. This range of sampled skin tone values for the analyzed object is then compared with all the tones contained in the video image, so as to identify other areas in the video sequence having the same skin tone values. The identification of likely skin regions in objects based on shape analysis and a determination of the signal energies characteristic of skin regions is advantageous. This is because the subsequent color sampling of such identified objects to determine a range of skin tone values, automatically compensates for color variations in the object and thus skin detection is made dynamic with respect to the content of a video sequence.
In the present illustrative example, the skin area detector is integrated with but functions independently of the other component parts of the video encoding/decoding (Codec) equipment which includes an encoder, a decoder and a coding controller. In one embodiment, the skin area detector is inserted between the input video signal and the coding controller, to provide input related to the location of skin areas in video sequences, prior to the encoding of the video images.
In one example of the present invention, the skin area detector includes a shape locator and a tone detector. The shape locator analyzes input video sequences to identify the edges of all the objects in a video frame and determine whether such edges approximate the outline of a shape that is likely to contain a skin area. The shape locator is advantageously programmed to identify certain shapes that are likely to contain skin areas. For example, since human faces have a shape that is approximately elliptical, the shape locator is programmed to search for elliptically shaped objects in the video signal.
Since an entire video frame is too large to analyze globally, it is advantageous if the video frame of an input video sequence is first partitioned into image areas. For each image area, the edges of objects are then determined based on changes in the magnitude of the pixel (picture element) intensities for adjacent pixels. If the changes in the magnitude of the pixel intensities for adjacent pixels in each image area are larger then a specified magnitude, the location of such an image area is identified as containing an edge or a portion of the edge of an object.
Thereafter, identified edges or a portion of identified edges are further analyzed to determine if such edges, which represent the outline of an object, approximate a shape that is likely to contain a skin area. Since skin areas are usually defined by the softer curves of human shapes (e.g., the nape of the neck, and the curve of the chin), rigid angular borders are not typically indicative of skin areas. Thus, configurations that are associated with softer human shapes are usually selected as likely to contain skin areas. For example, since an ellipse approximates the shape of a person""s face or head, the analysis of a video sequence to identify those outlines of objects which approximate ellipses, advantageously determines some locations in the video sequence that are likely to contain skin areas. Also, in the context of video conferencing, at least one person is typically facing the camera, so if one or more persons are in the room, then it is likely that an elliptical shape will be identified.
Once objects likely to contain skin areas are located by the shape locator the tone detector examines the picture elements (pixels) of each located object to determine if such pixels have signal energies that are characteristic of skin areas, then samples the range of skin tones for such identified objects and compares the range of sampled skin tones with the tones in the entire frame to determine all matching skin tones. In the present embodiment, the signal energy components (DC and AC energy components) of the luminance parameter are advantageously determined using the discrete cosine transformation (DCT) technique.
In the technique of the present invention, the discrete cosine transform (DCT) of the signal energy for a specified group of pixels in an object identified as likely to contain a skin area is calculated. Thereafter, the AC energy component of each pixel is determined by subtracting the DC energy component for each pixel from the discrete cosine transform (DCT). Based on the value of the AC energy component for each pixel, a determination is made as to whether the pixels have an AC signal energy characteristic of a skin area. If the AC signal energy for an examined pixel is less than a specified value, typically such pixels are identified as skin pixels. Thereafter, the tone detector samples the color parameters of such identified pixels and determines a range of color parameters indicative of skin tone that are contained within the region of the object.
The color parameters sampled by the tone detector are advantageously chrominance parameters, Cr and Cb. The term chrominance parameters as used herein refers to the color difference values of the video signal, wherein Cr is defined as the difference between the red color component and the luminance parameter (Y) of the video signal and Cb is defined as the difference between the blue color component and the luminance (Y) parameter of the video signal. The tone detector subsequently compares the range of identified skin tone values from the sampled object with the color parameters of the rest of the video frame to identify other skin areas.
The skin area detector of the present invention thereafter analyzes the next frame of the video sequence to determine the range of skin tone values and identify skin areas in the next video frame. The skin area detector optionally uses the range of skin tone values identified in one frame of a video sequence to identify skin areas in subsequent frames of the video sequence.
The skin area detector optionally includes an eyes-nose-mouth (ENM) region detector for analyzing some objects which approximate the shape of a person""s face or head, to determine the location of an eyes-nose-mouth (ENM) region. In one embodiment, the ENM region detector is inserted between the shape locator and the tone detector to identify the location of an ENM region and use such a region as a basis for analysis by the tone detector. The eyes-nose-mouth (ENM) region detector utilizes symmetry based methods to identify an ENM region located within an object which approximates the shape of a person""s face or head. It is advantageous for the eyes-nose-mouth (ENM) region to be identified since such a region of the face contains skin color parameters as well as color parameters other than skin tone parameters, including for example, eye color parameters, eyebrow color parameters, lip color parameters and hair color parameters. Also, the identification of the eye-nose-mouth (ENM) region reduces computational complexity, since skin tone parameters are sampled from a small region of the identified object.
Other objects and features of the present invention will become apparent from the following detailed description considered in conjunction with the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims.