Matting and compositing are frequently used in image and video editing, 3D photography, and film production. Matting separates a foreground region from an input image by estimating a color F and an opacity α for each pixel in the image. Compositing uses the matte to blend the extracted foreground with a novel background to produce an output image representing a novel scene. The opacity α measures a ‘coverage’ of the foreground region due to either partial spatial coverage or partial temporal coverage, i.e., motion blur. The set of all opacity values α is called the alpha matte, the alpha channel, or simply the ‘matte’.
The matting problem can be formulated as follows: An image of a foreground against an opaque black background in a scene is αF. An image of the background without the foreground is B. An alpha image or matte, where each pixel represents a partial coverage of that pixel by the foreground, is α. The image α is essentially an image of the foreground object ‘painted’ white, evenly lit, and held against the opaque background. The scale and resolution of the foreground and background images can differ due to perspective foreshortening.
The notions of an alpha matte, pre-multiplied alpha, and the algebra of composition have been formalized by Porter et al., “Compositing digital images,” in Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, ACM Press, pp. 253-259, 1984. They showed that for a camera, the image αF in front of the background image B can be expressed by a linear interpolation:I=αF+(1−α)B, where I is an image, αF is the pre-multiplied image of the foreground against an opaque background, and B is the image of the opaque background in the absence of the foreground.
Matting is described generally by Smith et al., “Blue screen matting,” Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques,” ACM Press, pp. 259-268, and U.S. Pat. No. 4,100,569, “Comprehensive electronic compositing system,” issued to Vlahos on July 11, 1978.
Conventional matting requires a background with known, constant color, which is referred to as blue screen matting. If a digital camera is used, then a green matte is preferred. Blue screen matting is the predominant technique in the film and broadcast industry. For example, broadcast studios use blue matting for presenting weather reports. The background is a blue screen, and the foreground region includes the presenter standing in front of the blue screen. The foreground is extracted, and then superimposed onto a weather map so that it appears that the presenter is actually standing in front of a map. However, blue screen matting is costly and not readily available to casual users. Even production studios would prefer a lower-cost and less intrusive alternative.
Ideally, one would like to extract a high-quality matte from an image or video with an arbitrary, i.e., unknown, background. This process is known as natural image matting. Recently, there has been substantial progress in this area, Ruzon et al., “Alpha estimation in natural images,” CVPR, vol. 1, pp. 18-25, 2000; Hillman et al., “Alpha channel estimation in high resolution images and image sequences,” Proceedings of IEEE CVPR 2001, IEEE Computer Society, vol. 1, pp. 1063-1068, 2001; Chuang et al., “A Bayesian approach to digital matting,” Proceedings of IEEE CVPR 2001, IEEE Computer Society, vol. 2, pp. 264-271, 2001; Chuang et al., “Video matting of complex scenes,” ACM Trans. on Graphics 21, 3, pp. 243-248, July, 2002; and Sun et al, “Poisson matting,” ACM Trans. on Graphics, August 2004. The Poisson matting of Sun et al. solves a Poisson equation for the matte by assuming that the foreground and background are slowly varying. Their method interacts closely with the user by beginning from a manually constructed trimap. They also provide ‘painting’ tools to correct errors in the matte.
Unfortunately, all of those methods require substantial manual intervention, which becomes prohibitive for long image sequences and for non-professional users. The difficulty arises because matting from a single image is fundamentally under-constrained.
It is desired to perform matting using non-intrusive techniques. That is, the scene does not need to be modified. It is also desired to perform the matting automatically. Furthermore, it is desired to provide matting for ‘rich’ natural images, i.e., images with a lot of fine, detailed structure.
Most natural image matting methods require manually defined trimaps to determine the distribution of color in the foreground and background regions. A trimap segments an image into background, foreground and unknown pixels. Using the trimaps, those methods estimate likely values of the foreground and background colors of unknown pixels, and use the colors to solve the matting equation.
Bayesian matting techniques, and their extension to image sequences, produce the best results in many applications. However, those methods require manually defined trimaps for key frames. This is tedious for a long image sequence. It is desired to provide a method that does not require user intervention, and that can operate in real-time as an image sequence is acquired.
Another matting system is described by Zitnick et al., “High-quality video view interpolation using a layered representation,” ACM Trans. on Graphics 23, 3, pp. 600-608, 2004. They acquire videos with a horizontal row of eight cameras spaced over about two meters. They measure depth discrepancies from stereo disparity using sophisticated region processing, and then construct a trimap from the depth discontinuities. The actual matting is determined by the Bayesian matting of Chuang et al. Their system is not real-time. The system requires off-line processing to determine both the depth and the alpha mattes.
It is desired to extract a matte without recovering the scene 3D structure so that mattes for complex, natural scenes can be extracted.
Difference matting, also known as background subtraction, solves for a and the alpha multiplied foreground, αF, given background and trimap images, Qian et al., “Video background replacement without a blue screen,” Proceedings of ICIP, vol. 4, 143-146, 1999. However, difference matting has limited discrimination at the borders of the foreground.
Another method uses back lighting to determine the matte. Back lighting is a common segmentation method used in many computer vision systems. Back lighting has also been used in image-based rendering systems, Debevec et al., “A lighting reproduction approach to live action compositing,” ACM Transactions on Graphics 21, 3, pp. 547-556, 2002. That method has two drawbacks. First, active illumination is required, and second, incorrect results may be produced near object boundaries because some objects become highly reflective near grazing angles of the light.
Scene reconstruction is described by Favaro et al., “Seeing beyond occlusions (and other marvels of a finite lens aperture),” Proc. of the IEEE Intl. Conf. on Computer Vision and Pattern Recognition, p. 579, 2003. That method uses defocused images and gradient descent minimization of a sum-squared error. The method solves for coarse depth and a binary alpha.
Another method uses a depth-from-focus system to recover overlapping objects with fractional alphas, Schechner et al, “Separation of transparent layers using focus,” International Journal of Computer Vision, pp. 25-39, 2000. They position a motorized CCD axially behind a lens to acquire images with slightly varying points of focus. Depth is recovered by selecting the image plane location that has the best focused image. That method is limited to static scenes.
Another method uses three video streams acquired by three cameras with different depth-of-field and focus that share the same center of projection to extract mattes for scenes with unconstrained, dynamic backgrounds, McGuire et al., “Defocus Video Matting,” ACM Transactions on Graphics 24, 3, 2003, and U.S. patent application Ser. No. 11/092,376, filed by McGuire et al. on Mar. 29, 2005, “System and Method for Image Matting.” McGuire et al. determine alpha mattes for natural video streams using three video streams that share a common center of projection but vary in depth of field and focal plane. However, their method takes a few minutes per frame.