Efficient and high-quality compositing of images is an important task in the special effects industry. Typically, movie scenes are composited from two different layers, foreground and background, where each layer can be computer-generated or real, and may be filmed at different locations. Often, the foreground content of a source video is used as the foreground layer in a composite video, which requires segmentation of foreground from background in the source video.
The process of segmenting an image into foreground and background is referred to as ‘pulling’ an alpha matte or ‘matting’. The most popular method for pulling alpha mattes is blue-screen matting, in which actors are imaged in front of a blue or green background. The limitation of blue-screen matting is that it can only be used in a studio or a similarly controlled environment and can not be used in natural indoor or outdoor settings.
Natural video matting refers to pulling alpha mattes from a video acquired in a natural environment. With a single video stream, the problem of matte extraction can be posed as an equation in several unknowns: alpha (α), RGB foreground (FRGB), RGB background (BRGB). The RGB video frame I at each pixel isIRGB=αFRGB+(1−α)BRGB.  (1)With a single image, this problem is highly underconstrained.
The first matting methods and systems were described almost fifty years ago. Blue-screen matting was formalized by Smith and Blinn, “Blue screen matting,” Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 259-268, ACM Press, 1996. They showed that imaging a foreground against two different backgrounds gives a robust solution for both the alpha and the foreground color. That method has been extended to work with more complex light transport effects, e.g., refraction. However, those methods require active illumination and acquiring multiple images.
Bayesian matting was initially developed for static scenes. It assumes a low-frequency background and a user specified trimap. Generally, a trimap includes pixels labeled as foreground, pixels labeled as background, and pixels labeled as unknown. Matting requires that the unknown pixels are labeled correctly.
That method was later extended to videos. However, trimaps still need to be specified manually for key frames. In another extension, a multi-camera system is used to reconstruct 3D scene geometry. High-quality alpha mattes are determined at depth discontinuities.
Poisson matting poses alpha matting as solving Poisson equations of the matte gradient field. It does not work directly on the alpha but on a derived measurement, and conventionally works on still images, requires some user intervention, and takes several minutes to process a single frame.
Video matting for natural scenes is described by Wang et al., “Interactive video cutout,” ACM Transactions on Graphics, August 2005; and Li et al., “Video object cut and paste,” ACM Transactions on Graphics, August 2005. Wang et al. focus on providing an efficient user interface to achieve the task, while Li et al. use a novel 3D graph cut algorithm, followed by manual refinement to prepare the data for alpha matting.
Another method determines alpha mattes for natural video streams using three video streams that share a common center of projection but vary in depth of field and focal plane, McGuire et al., “Defocus Video Matting,” ACM Transactions on Graphics, August 2005. While their method is automatic, the running time for their method is many minutes per frame. In addition, the foreground object must be in focus.
Other methods consider bounded reconstruction and graph cuts, see Wexler et al., “Bayesian estimation of layers from multiple images,” Proceedings of 7th European Conference on Computer Vision (ECCV); and Kolmogorov et al., “Bi-layer segmentation of binocular stereo video,” Proceedings of CVPR05, 2005. Wexler et al. pose the problem in a Bayesian framework and consider several different priors including bounded reconstruction, α-distribution and spatial consistency. They do not describe real-time aspects of their system. Kolmogorov et al. on the other hand, do not focus on alpha matting but rather describe a real-time system that uses graph cuts on a stereo video to perform the foreground and background segmentation.
Camera arrays have been used for a wide variety of applications in computer graphics and computer vision, see generally, Wilburn et al., “High performance imaging using large camera arrays,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 765-776, 2005.