Matting studies how to extract foreground objects with per-pixel transparency information from still images or video sequences. Generally speaking, it tries to solve the following ill-posed problem. Given a color image I, which contains both foreground and background objects, calculate the matte α, foreground color F, and background color B, so that the following alpha compositing equation is satisfied:I−B=α(F−B)  (1)
A variety of existing techniques have been developed for still image matting and several algorithms are also proposed for handling video sequences. However, due to the high computational cost involved these methods are not practical for real-time applications, and to date, high quality real-time video matting for dynamic scenes can only be achieved under studio settings using specially designed optical devices and polarized lighting conditions, as is shown in M. McGuire, W. Matusik, and W. Yerazunis, “Practical, Real-time Studio Matting using Dual Imagers,” Proc. Eurographics Symposium on Rendering, 2006 (McGuire et al.). There remains a need for a real-time video matting system and method based on color information only for high quality video composition.
Other existing techniques for separating foreground objects from live videos use bilayer segmentation, as shown in A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov, “Bilayer Segmentation of Live Video,” Proc. CVPR, 2007 (Criminisi et al.) and J. Sun, W. Zhang, X. Tang, and H.-Y. Shum, “Background Cut,” Proc. ECCV, pp. 628-641, 2006 (Sun et al. 2006). Using just color information, their algorithm can extract the moving foreground object in real-time, making it a powerful technique for video conferencing and live broadcasting. However, bilayer segmentation cannot capture the fuzzy boundaries surrounding the foreground object caused by hair, fur, or even motion blur. Although the border matting technique, as described in C. Rother, V. Kolmogorov, and A. Blake, ““GrabCut”: interactive foreground extraction using iterated graph cuts,” Proc. Siggraph, pp. 309-314, 2004 (Rother et al.) is applied to alleviate the aliasing problem along object boundaries, the strong constraint used in border matting limits its capability of handing objects with complex alpha matte, such as the one shown in FIG. 1. As such, there remains a need for a system and method capable of real-time video matting to extract alpha matte in so called “fuzzy” areas within the video image.
The prior art for image matting techniques (as opposed to video matting techniques) was well summarized in J. Wang and M. Cohen, “Image and Video Matting: A Survey,” FTCGV, vol. 3, no. 2, 2007.
There are a number of non-real time, offline techniques which require a posteriori knowledge, or future frames, to obtain an accurate trimap estimation of the foreground image, background image and boundary. In their Bayesian video matting approach, Y.-Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin, and R. Szeliski, “Video Matting of Complex Scenes,” Proc. Siggraph, pp. 243-248, 2002, (Chuang et al. 2002) require users to manually specify trimaps for some key frames. These trimaps are then propagated to all frames using the estimated bidirectional optical flows. Finally the alpha matte for each frame is calculated independently using Bayesian matting, also shown in Y.-Y. Chuang, B. Curless, D. Salesin, and R. Szeliski, “A Bayesian Approach to Digital Matting,” Proc. CVPR, pp. 264-271, 2001 (Chuang et al. 2001). Trimaps may also be generated from binary segmentations in two know video object cutout approaches (Y. Li, J. Sun, and H.-Y. Shum, “Video Object Cut and Paste,” Proc. Siggraph, pp. 595-600, 2005 and J. Wang, P. Bhat, R. A. Colburn, M. Agrawala, and M. F. Cohen, “Interactive video cutout,” Proc. Siggraph, pp. 585-594, 2005). Individual frames are over-segmented into homogenous regions, based on which a 3D graph is constructed. The optimal cut that separates foreground and background regions are found using 3D graph cuts. Pixels within a narrow band of the optimal cut are labelled as unknown regions, with their alpha values estimated using image matting techniques. In the geodesic matting algorithm described in X. Bai and G. Sapiro, “A Geodesic Framework for Fast Interactive Image and Video Segmentation and Matting,” Proc. ICCV, 2007 (Bai & Sapiro), no over-segmentation is required as the algorithm treats the video sequence as a 3D pixel volume. Each pixel is classified into foreground or background based on its weighted geodesic distances to the foreground and background scribbles that users specified for a few key frames. The alpha values for pixels within a narrow band along the foreground/background boundaries are explicitly computed using geodesic distances. The above approaches are all designed to handle pre-captured video sequences offline, both of which utilize the temporal coherence (i.e. future information) for more accurate results.
Existing “online”/“real time” video matting techniques available suffer from undesirable computational delay or the requirement for multiple cameras. For example in the defocus matting technique of M. McGuire, W. Matusik, H. Pfister, J. F. Hughes, and F. Durand, “Defocus Video Matting,” Proc. Siggraph, 2005, the scene is captured using multiple optically aligned cameras with different focus/aperture settings and the trimap is automatically generated based on the focus regions of captured images. However, the alpha matte is then calculated by solving an error minimization problem, a computation which takes several minutes per frame.
Automatic video matting can also be done using a camera array, as shown in N. Joshi, W. Matusik, and S. Avidan, “Natural Video Matting using Camera Arrays,” Proc. Siggraph, 2006. The images captured are aligned so that the variance of pixels reprojected from the foreground is minimized whereas the one of pixels reprojected from the background is maximized. The alpha values are calculated using a variance-based matting equation. The computational cost is linear with respect to the number of cameras and near-real-time processing speed is achieved. In M. McGuire, W. Matusik, and W. Yerazunis, “Practical, Real-time Studio Matting using Dual Imagers,” Proc. Eurographics Symposium on Rendering, 2006 (McGuire et al.), the background screen is illuminated with polarized light and the scene is captured by two cameras each with a different polarizing filter. Since the background has different colors in the two captured images, the simple blue screen matting can be applied to extract the alpha matte in real-time, but only in this controlled setting. These “online” matting approaches require images captured from multiple cameras and utilize additional information, such as focus, polarization settings or viewpoint changes. There is a need for an “online” (i.e. using only current and past frames) and real-time video matting system and method which may be implemented using one camera (i.e. one input video stream) and which can generate accurate alpha matte data in real-time using color information only.
An existing method with a useful approach to the problem in image matting is the Poisson matting algorithm from J. Sun, J. Jia, C.-K. Tang, and H.-Y. Shum, “Poisson matting,” Proc. Siggraph, pp. 315-321, 2004 (Sun et al. 2004). However, in Sun et al. 2004, matting is performed on a single color channel k.
Poisson matting is computationally efficient and easy to implement. However, it tends to yield large errors when the trimap is imprecise and/or the background is not smooth. Sun et al suggest that manual editing using local Poisson equations can be applied to correct the errors, but this approach is impractical when handling video sequences.
In O. Wang, J. Finger, Q. Yang, J. Davis, and R. Yang, “Automatic Natural Video Matting with Depth,” Proc. PG, 2007, it was shown that additional depth information captured using a depth sensor helps to improve matting qualities. However, for Poisson matting, the depth information had only previously been used for validation, since prior Poisson based methods only used a single color channel.
There is a need for an improved system and method for video matting which does not require future video sequence information to perform matting on the current frame, and which is capable of robust boundary analysis. There is also a need for a video matting system which is able to operate in real time (i.e. during the inter-image capture period of the film process), using only color vector information, and, if available, depth information, to generate a commercially reliable foreground and background extraction. Real-time can be understood as a process which meets either of these criteria.