Image matting is the process of extracting an object from an image with some human guidance. Image matting may be an interactive process which relies on limited user input, usually in the form of a few scribbles, to mark foreground and background regions. Henceforth, “foreground” refers to the object to be extracted, whereas “background” refers to everything else in the image.
Video matting is an extension of image matting wherein the goal is to extract a moving object from a video sequence. Video matting can also be used in video processing devices (including video encoders). For instance, automatic matte extraction can be used to identify a particular region in a video scene (e.g. sky area), and then apply a given processing only to that region (e.g. de-banding or false contour removal). Matte extraction can also be used to guide object detection and object tracking algorithms. For instance, a matte extraction technique could be used to detect the grass area in a soccer video (i.e. the playfield) which could then use to constrain the search range in a ball tracking algorithm.
In moviemaking and television, mattes have been used to composite foreground (e.g. actors) and background (e.g. landscape) images into a final image. The chroma keying (blue screen) technique is a widely used method for matting actors into a novel background. Many of the traditional techniques rely on a controlled environment during the image capture process. With digital images, however, it becomes possible to directly manipulate pixels, and thus matte out foreground objects from existing images with some human guidance. Digital image matting is used in many image and video editing applications for extracting foreground objects and possibly for compositing several objects into a final image.
As mentioned, image matting is usually an interactive process in which the user provides some input such as marking the foreground and possibly the background regions. The simpler the markings are, the more user-friendly the process is. Among the easier-to-use interfaces are those in which the user places a few scribbles with a digital brush marking the foreground and background regions (see FIG. 2A). An image matting process then determines the boundary of the foreground object using the image information along with the user input.
In several image matting methods, the user provides a rough, usually hand-drawn, segmentation called a trimap, wherein each pixel is labeled as a foreground, background, or unknown pixel. (See U.S. Pat. No. 6,135,345 to Berman et al., “Comprehensive method for removing from an image the background surrounding a selected object”; and Y. Y. Chuang et al., “A Bayesian approach to digital matting,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2001.) Other methods allow a more user-friendly scribble-based interaction in which the user places a few scribbles with a digital brush marking the foreground and background regions. (See J. Wang et al., “An iterative optimization approach for unified image segmentation and matting,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2005; and A. Levin et al., “A closed-form solution to natural image matting,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 228-242, February 2008.)
In all the above methods, the user input is provided for matting out the foreground from a single image. Video matting is a harder problem as it may involve a moving foreground object. In this case, the user input for one frame may not be accurate for subsequent frames. Moreover, it is labor-intensive to require the user to provide input for each frame in the video.
In the video matting method proposed by Chuang et al., a trimap is provided for each of several keyframes in the video, and the trimaps are interpolated to other frames using forward and backward optical flow. (Y. Y. Chuang et al., “Video matting of complex scenes,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 243-248, 2002.) Optical flow-based interpolation, however, is time-consuming, noise sensitive, and unreliable, even for moderate motion levels. Furthermore, optical flow-based interpolation of user-provided scribbles results in the scribbles breaking up over time. Apostoloff et al. describe a method in which trimaps are implicitly propagated from frame to frame by imposing spatiotemporal consistency at edges. (N. E. Apostoloff et al., “Bayesian video matting using learnt image priors,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2004.) The complexity of this method, however, can be substantial due to the enforcement of spatiotemporal edge consistency between the original image and the alpha mattes