1. Field of the Invention
The invention is generally related to digital image processing, and, in particular, is related to extracting objects from a first sequence of frames and using the extracted objects to create a new sequence of frames.
2. Description of Related Art
Video cameras are becoming more popular today, as they become more widely available, at lower prices. A video camera records sequential images within “frames.” A frame is a representation of an image at an instant of time. Typically, each frame represents the image at a different instant in time. When several frames are recorded, at sequential instances in time, and are shown to the human eye in quick succession, the human eye is able to see motion in the video sequence (e.g., a sequence of frames). For example, video (i.e., moving pictures) normally consists of motion, including object motion, such as a bird flying, and camera motion, such as camera panning, zooming, and tilting.
Video object segmentation refers to identifying objects, which may be moving, through frames of a video sequence. Efficient video object segmentation is an important topic for video processing. General video object segmentation is considered to be an ill-posed problem. An ill-posed problem is one that, theoretically, cannot be solved. For example, a two-dimensional photograph does not contain three-dimensional information (i.e., the three-dimensional information is lost when a two-dimensional photograph is taken of a three-dimensional scene). Therefore, converting a two-dimensional photograph to a three-dimensional image is considered to be an ill-posed problem. Likewise, segmenting objects in frames of a video sequence is considered to be an ill-posed problem. In particular, the object of interest may be different for the same scene depending on the user or the application. Automatic segmentation is therefore a problem without a general solution.
Real-time automatic segmentation techniques for specific applications, however, have been attempted. These techniques are directed to applications such as video surveillance, traffic monitoring, and video-conferencing. Video sequences in these applications are typically taken from fixed cameras in poor environments and are in need of robust object segmentation techniques.
For more information on video surveillance applications, see Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What'? A Real Time System for Detecting and Tracking People,” presented at Proceedings of the Third International Conference on Automatic Face and Gesture Recognition (FG'98), pp. 1–6, April 1998, and, Paul L. Rosin, “Thresholding for Change Detection,” presented at Proceedings of International Conference on Computer Vision, pp. 1–6, 1998, each of which is entirely incorporated by reference herein.
For more information on video surveillance applications, see Nir Friedman and Stuart Russell, “Image Segmentation in Video Sequences: A Probabilistic Approach,” presented at Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, pp. 1–13, 1997, which is entirely incorporated by reference herein.
A typical approach to extracting objects from a static background scene subtracts the background scene and labeling what remains as objects. The main idea of background subtraction is comparing a current frame with a reference background scene according to certain criteria. The background scene may be obtained, for example, by capturing a static scene with a video camera. Elements different from the background scene are designated as foreground objects.
A number of systems performing video object segmentation use background subtraction techniques. However, many background subtraction techniques are sensitive to shadows or illumination changes, and so they are unable to accurately extract objects. Shadows occur when a light source on a scene is obstructed. Illumination refers to the amount of source light on a scene. Both shadow and illumination may change frame to frame due to, for example, movement of objects. Thus, these background subtraction techniques may incorrectly classify a pixel that is in the background in a first frame, but in shadow in a second frame, as a foreground pixel.
For more information on background subtraction see, N. Friedman, and S. Russell, “Image Segmentation in Video Sequences: A Probabilistic Approach,” Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 1997; Thanarat Horprasert, David Harwood. and Larry S. Davis, “A Statistical Approach for Real-time Robust Background Subtraction and Shadow Detection,” presented at Proc. IEEE ICCV'99 FRAME-RATE Workshop, Kerkyra, Greece, pp. 1–19, September 1999 (hereinafter referred to as “Horprasert”); Alexandre R. J. Francois and Gerard G. Medioni, “Adaptive Color Background Modeling for Real-Time Segmentation of Video Streams,” presented at Proc. of International on Imaging Science, System, and Technology, pp. 1–6, 1999 (hereinafter referred to as “Francois”); and, Ahmed Elgammal, David Harwood, Larry Davis, “Non-parametric Model for Background Subtraction,” presented at FRAME-RATE: Framerate Applications, Methods and Experiences with Regularly Available Technology and Equipment, Corfu, Greece, pp. 1–17, Sep. 21, 1999; each of which is entirely incorporated by reference herein.
In some cases, objects are segmented in color spaces other than the RGB color space. For example, Wren detects changes in a YUV color space. For further information, see Christopher Wren, Ali Azarbayejani, Trevor Darrell, and Alex Pentland, “PFinder: Real-Time Tracking of the Human Body,” presented at IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 1–7, 1997 (hereinafter referred to as “Wren”) which is entirely incorporated herein by reference. The YUV color space is defined by the Commission International de L'Eclairage (CIE), which is an international committee for color standards. The YUV color space may be used, for example, in Phase Alternation Line (PAL) television (an analog television display standard), where the luminance (i.e., a measure of the amount of energy an observer perceives from a light source) and the chrominance (i.e., hue and saturation together) are treated as separate components. In YUV systems, a luminance signal is represented with “Y”, while chrominance signals are represented by “U” and “V.”
Horprasert uses a color model consisting of brightness distortion and chromaticity distortion to present a color image. Both Wren and Horprasert overcome this problem to some extent but are not as robust as expected (i.e., they fail to segment objects under some conditions). Francois proposes to perform segmentation in an HSV color space, but because of improper use of hue and saturation components, Francois also runs unsteadily (i.e., sometimes classifies pixels incorrectly). The HSV color space refers to hue, saturation, and value. Hue refers to a pure color as perceived by the human eye (e.g., red, green, or blue). Hue identifies the color and indicates where along the color spectrum the color lies. This value wraps around so that the extreme high value (white) and the extreme low value (black) translate into the same value on the color scale. Saturation refers to the amount of white light mixed in with hue. That is, saturation may be thought of as how pure a color is. Greater values (i.e., more white) in the saturation channel make the color appear stronger, while lower values (i.e., less white, tending to black) make the color appear very washed out. Value refers to how bright the color is. White values have the maximum brightness, while black values have little or no brightness at all.