In streaming video systems such as video conferencing, the video image can be regarded as composition of a background image and a foreground image, wherein the background image consists of various stationary objects and the foreground image consists of objects that are moving. Particularly in video conferencing, the foreground image refers to people in the conference, and the background image refers to the video image that would otherwise be captured by the camera if there are no people in front of the camera.
Identifying or registering the background image and detecting faces of persons in a video stream are important for applying the appropriate intelligent coding decisions. For example, portions of a video image that correspond to a person's face are encoded differently from the encoding used for the background image. As another example, the background image may be used as a reference picture for encoding uncovered portions of the image while the foreground is moving.
Many methods for face detection are known. However, these methods generally require knowledge of the entire picture at the time of processing, which is problematic for a video conference system. In video conference systems, low latency is desired and a distributed architecture is often used to process a video image. In a distributed coding architecture, different parts of an image may be processed simultaneously in different distributed elements such that the entire image is not available. Additionally, because video conference systems generally transmit live video, the face detection architecture needs to identify facial regions of an image at a rate equal to that of the video rate.
Designers have attempted to resolve this issue with limited degrees of success. More particularly, methods solely based on the texture and color information of the video image have been found ineffective when there is a complex background. Methods that incorporate temporal information of the video image still fail to achieve reliability under various motion possibilities of the foreground image.
Therefore, what is desired are systems and methods that overcome challenges found in the art, including a method for constructing the background image and detecting and tracking the face and torso portions of a video image with a complex background at a rate equal to that of the video rate.