The present invention relates to a system for encoding facial regions of a video that incorporates a model of the human visual system to encode frames in a manner to provide a substantially uniform apparent quality.
In many systems the number of bits available for encoding a video, consisting of a plurality of frames, is fixed by the bandwidth available in the system. Typically encoding systems use an ad hoc control technique to select quantization parameters that will produce a target number of bits for the video while simultaneously attempting to encode the video frames with the highest possible quality. For example, in digital video recording, a group of frames must occupy the same number of bits for an efficient fast-forward/fast-rewind capability. In video telephones, the channel rate, communication delay, and the size of the encoder-buffer determines the number of available bits for a frame.
There are numerous systems that address the problem of how to encode video to achieve high quality while controlling the number of bits used. The systems are usually known as rate, quantizer, or buffer control techniques and can be generally classified into three major classes.
The first class are systems that encode each block of the image several times with a set of different quantization factors, measure the number of bits produced for each quantization factor, and then attempt to select a quantization factor for each block so that the total number of bits for all the blocks total a target number While generally accurate, such a technique is not suitable for real-time encoding systems because of its high computational complexity.
The second class are systems that measure the number of bits used in previously encoded image blocks, buffer fullness, block activity, and use all these measures to select a quantization factor for each block of the image. Such techniques are popular for real-time encoding systems because of their low computational complexity. Unfortunately, such techniques are quite inaccurate and must be combined with additional techniques to avoid bit or buffer overflows and underflows.
The third class are systems that use a model to predict the number of bits necessary for encoding each of the image blocks in terms of the block's quantization factor and other simple parameters, such as block variances. These models are generally based on mathematical approximations or predefined tables. Such systems are computationally simple and are suitable for real-time systems, but unfortunately they are highly sensitive to inaccuracies in the model itself.
Some rate control systems incorporate face detection. One of such systems, along with other systems that use face detection, is described below.
Zhou, U.S. Pat. No. 5,550,581, discloses a low bit rate audio and video communication system that dynamically allocates bits among the audio and video information based upon the perceptual significance of the audio and video information. For a video teleconferencing system Zhou suggests that the perceptual quality can be improved by allocating more of the video bits to encode the facial region of the person than the remainder of the scene. In addition, Zhou suggests that the mouth area, including the lips, jaw, and cheeks, should be allocated more video bits than the remainder of the face because of the motion of these portions. In order to encode the face and mouth areas more accurately Zhou uses a subroutine that incorporates manual initialization of the position of each speaker within a video screen. Unfortunately, the manual identification of the facial region is unacceptable for automated systems.
Kosemura et al., U.S. Pat. No. 5,187,574, disclose a system for automatically adjusting the field of view of a television door phone in order to keep the head of a person centered in the image frame. The detection system relies on detecting the top of the person's head by comparing corresponding pixels in successive images. The number of pixels are counted along a horizontal line to determine the location of the head. However, such a head detection technique is not robust.
Sexton, U.S. Pat. No. 5,086,480, discloses a video image processing system in which an encoder identifies the head of a person from a head-against-a-background scene. The system uses training sequences and fits a minimum rectangle to the candidate pixels. The underlying identification technique uses vector quantization. Unfortunately, the training sequences require the use of an anticipated image which will be matched to the actual image. Unfortunately, if the actual image in the scene does not sufficiently match any of the training sequences then the head will not be detected.
Lambert, U.S. Pat. No. 5,012,522, discloses a system for locating and identifying human faces in video scenes. A face finder module searches for facial characteristics, referred to as signatures, using a template. In particular, the signatures searched for are the eye and nose/mouth. Unfortunately, such a template based technique is not robust to occlusions, profile changes, and variations in the facial characteristics.
Ueno et al., U.S. Pat. No. 4,951,140, discloses a facial region detecting circuit that detects a face based on the difference between two frames of a video using a histogram based technique. The system allocates more bits to the facial region than the remaining region. However, such a histogram based technique may not necessarily detect the face in the presence of significant motion.
Moghaddam et al., in a paper entitled “An Automatic System for Model-Based Coding of Faces,” IEEE Data Compression Conference, March 1995, discloses a system for two-dimensional image encoding of human faces. The system uses eigen-templates for template matching which is computationally intensive.
Eleftheriadis et al., in a paper entitled “Automatic Face Location Detection and Tracking for Model-Assisted Coding of Video Teleconferencing Sequences at Low Bit-Rates,” Signal Processing: Image Communication 7 (1995), disclose a model-assisted coding technique which exploits the face location information of video sequences to selectively encode regions of the video to produce coded sequences in which the facial regions are clearer and sharper. In particular, the system initially differences two frames of a video to detect motion. Then the system attempts to locate the top of the head of a person by searching for a sequential series of non-zero horizontal pixels in the difference image, as shown in FIG. 11 of Eleftheriadis et al. A set of ellipses with-various sizes and aspect ratios having their uppermost portion fixed at the potential location of the top of the head are fitted to the image data. Unfortunately, scanning the difference image for potential sequences of non-zero pixels is complex and time consuming. In addition, the system taught by Eleftheriadis et al. includes many design parameters that need to be selected for each particular system and video sequences making it difficult to adapt the system for different types of video sequences and systems.
Glenn, in a chapter entitled “Real-Time Display Systems, Present and Future,” from the book Visual Science Engineering, edited by O.H. Kelly, 1994, teaches a display system that varies the resolution of the image from the center to the edge, in the hope that the decrease in resolution would lead to a bandwidth reduction. The resolution decrease is accomplished by discarding pixel information to blur the image. The presumption in Glenn is that the observer is looking at the center of the display. The attempt was unsuccessful because although it was found that the observer's eyes tended to stay in the center one-quarter of the total image area, the resolution at the edges of the image could not be sufficiently reduced before the resulting blur was detectable.
Browder et al., in a paper entitled “Eye-Slaved Area-Of-Interest Display Systems: Demonstrated Feasible In The Laboratory,” process video sequences using gaze-contingent techniques. The gaze-contingent processing is implemented by adaptively varying image quality within each video field, such that image quality is maximal in the region most likely to be viewed while being reduced in the periphery. This image quality reduction is accomplished by blurring the image or by introducing quantization artifacts. The system includes an eye tracker with a computer graphic flight simulator. Two image sequences are created. One sequence has a narrow field of view (19 or 25 degrees) with high resolution and the other sequence has a wide field of view (76 or 140 degrees) with low resolution. The two image sequences are combined optically with the high resolution sequence enslaved to the visual system's instantaneous center of gaze. To keep the boundary between the two regions from being distracting an arbitrary linear rolling off (blending) from the high resolution inset image to the low resolution image is used. The use of an eye tracker in the system is unsuitable for inexpensive video telephones where such an eye tracker is not provided. In addition, the linear roll-off does not match the eye's sensitivity variation, resulting in either variable image quality, or unnecessary regions of high resolution.
Stelmach et al., in a paper entitled “Processing Image Sequences Based On Eye Movements,” disclose a video encoding system that employs the concept of varying the visual sensitivity as a function of expected eye position. The expected eye position is generated by measuring a set of observers' eye movements to specific video sequences. Then the averaged eye movements are calculated for the set of observers. However, such a system requires measurements of the eye position which may not be available for inexpensive teleconferencing systems. In addition, it is difficult, if not impossible, to extend the system to an unknown image sequence thus requiring observer measurements for any image sequence the system is going to encode. Moreover, variation of the resolution is not an efficient technique for bandwidth reduction.
What is desired, therefore, is a video encoding system that automatically locates facial regions within the video and encodes the video in a manner that provides a uniform quality of the video to a viewer.