Interactive digital matting, the process of extracting a foreground object from an image based on limited user input, is an important task in image and video editing. From a computer vision perspective, this task is extremely challenging because it is massively ill-posed—at each pixel we must estimate the foreground and the background colors, as well as the foreground opacity (“alpha matte”) from a single color measurement. Current approaches either restrict the estimation to a small part of the image, estimating foreground and background colors based on nearby pixels where they are known, or perform iterative nonlinear estimation by alternating foreground and background color estimation with alpha estimation.
Natural image matting and compositing is of central importance in image and video editing. The goal is to extract a foreground object, along with an opacity map (alpha matte) from a natural image, based on a small amount of guidance from the user. Thus, FIG. 1 shows how matting is used to extract a foreground object from an image shown in FIG. 1(a) and compositing it with a novel background shown in FIG. 1(e). Traditionally, this has been done using a trimap interface as shown in FIG. 1(b). As will be shown in the following description, the invention permits a high quality matte shown in FIG. 1(d) to be obtained with a sparse set of scribbles shown in FIG. 1(c).
What distinguishes matting and compositing from simple “cut and paste” operations on the image is the challenge of correctly handling “mixed pixels”. These are pixels in the image whose color is a mixture of the foreground and background colors. Such pixels occur, for example, along object boundaries or in regions containing shadows and transparency. While mixed pixels may represent a small fraction of the image, human observers are remarkably sensitive to their appearance, and even small artifacts could cause the composite to look fake. Formally, image matting methods take as input an image I, which is assumed to be a composite of a foreground image F and a background image B. The color of the i-th pixel is assumed to be a linear combination of the corresponding foreground and background colors,Ii=αiFi+(1−αi)Bi  (1)where αi is the pixel's foreground opacity. In natural image matting, all quantities on the right hand side of the compositing equation (1) are unknown. Thus, for a 3 channel color image, at each pixel there are 3 equations and 7 unknowns.
Obviously, this is a severely under-constrained problem, and user interaction is required to extract a good matte. Most recent methods expect the user to provide a trimap [1, 2, 4, 5, 12, 14] as a starting point; an example is shown in FIG. 2(e). The trimap is a rough (typically hand-drawn) segmentation of the image into three regions: foreground (shown in white), background (shown in black) and unknown (shown in gray). Given the trimap, these methods typically solve for F, B and α simultaneously. This is typically done by iterative nonlinear optimization, alternating the estimation of F and B with that of α. In practice, this means that for good results the unknown regions in the trimap must be as small as possible. As a consequence, trimap-based approaches typically experience difficulty handling images with a significant portion of mixed pixels or when the foreground object has many holes [15]. In such challenging cases a great deal of experience and user interaction may be necessary to construct a trimap that would yield a good matte. Another problem with the trimap interface is that the user cannot directly influence the matte in the most important part of the image: the mixed pixels. It would clearly be preferable to provide more direct control over these mixed regions.
The requirement of a hand-drawn segmentation becomes far more limiting when one considers image sequences. In these cases the trimap needs to be specified over key frames and interpolated between key frames.
While good results have been obtained by intelligent use of optical flow [4], the amount of interaction obviously grows quite rapidly with the number of frames.
Another problem with the trimap interface is that the user cannot directly influence the matte in the most important part of the image: the mixed pixels. When the matte exhibits noticeable artifacts in the mixed pixels, the user can refine the trimap and hope this improves the results in the mixed region.
As noted above, most existing methods for natural image matting require the input image to be accompanied by a trimap [1, 2, 4, 5, 12, 14], labeling each pixel as foreground, background, or unknown. The goal of the method is to solve the compositing equation (1) for the unknown pixels. This is typically done by exploiting some local regularity assumptions on F and B to predict their values for each pixel in the unknown region. In the Corel KnockOut algorithm [2], F and B are assumed to be smooth and the prediction is based on a weighted average of known foreground and background pixels (closer pixels receive higher weight). Some algorithms [5, 12] assume that the local foreground and background come from a relatively simple color distribution. Perhaps the most successful of these algorithms is the Bayesian matting algorithm [5], where a mixture of oriented Gaussians is used to learn the local distribution and then α, F and B are estimated as the most probable ones given that distribution. Such methods work well when the color distributions of the foreground and the background do not overlap, and the unknown region in the trimap is small. As demonstrated in FIG. 2(b) a sparse set of constraints could lead to a completely erroneous matte.
The Bayesian matting approach has been extended to video in two recent papers. Chuang [4] use optical flow to warp the trimaps between keyframes and to dynamically estimate a background model. Apostoloff and Fitzgibbon [1] minimize a global, highly nonlinear cost function over α, F and B for the entire sequence. Their cost function includes the mixture of Gaussians log likelihood for foreground and background along with a term biasing α towards 0 and 1, and a learnt spatiotemporal consistency prior on α. The algorithm can either receive a trimap as input, or try to automatically determine a coarse trimap using background subtraction.
The Poisson matting method [14], also expects a trimap as part of its input, and computes the alpha matte in the mixed region by solving a Poisson equation with the matte gradient field and Dirichlet boundary conditions. In the global Poisson matting method the matte gradient field is approximated as ∇I/(F−B) by taking the gradient of the compositing equation, and neglecting the gradients in F and B. The matte is then found by solving for a function whose gradients are as close as possible to the approximated matte gradient field. Whenever F and B are not sufficiently smooth inside the unknown region, the resulting matte might not be correct, and additional local manipulations may need to be applied interactively to the matte gradient field in order to obtain a satisfactory solution. This interactive refinement process is referred to as local Poisson matting.
Recently, several successful approaches for extracting a foreground object from its background have been proposed [3,9,11]. These approaches translate simple user-specified constraints (such as scribbles, or a bounding rectangle) into a min-cut problem. Solving the min-cut problem yields a hard binary segmentation, rather than a fractional alpha matte (FIG. 2(c)). The hard segmentation could be transformed into a trimap by erosion, but this could still miss some fine or fuzzy features (FIG. 2(d)). Although Rother [11] do perform border matting by fitting a parametric alpha profile in a narrow strip around the hard boundary, this is more akin to feathering than to full alpha matting, since wide fuzzy regions cannot be handled in this manner.
Both the colorization method of Levin [7] and the random walk alpha matting method of Grady [6] propagate scribbled constraints to the entire image by minimizing a quadratic cost function. Another scribble-based interface for interactive matting was recently proposed by Wang and Cohen [15]. Starting from a few scribbles indicating a small number of background and foreground pixels, they use belief propagation to iteratively estimate the unknowns at every pixel in the image. While this approach has produced some impressive results, it has the disadvantage of employing an expensive iterative non-linear optimization process, which might converge to different local minima.
Wang's iterative matte optimization attempts to determine for each pixel all of the unknown attributes (F, B and α) and to reduce the uncertainty of these values. Initially, all user-marked pixels have uncertainty of 0 and their α and F or B colors are known. For all other pixels, the uncertainty is initialized to 1 and α is set to 0.5. The approach proceeds iteratively: in each iteration, pixels adjacent to ones with previously estimated parameters are considered and added to the estimated set. The process stops once there are no more unconsidered pixels left and the uncertainty cannot be reduced any further. Belief Propagation (BP) is used in each iteration. The optimization goal in each iteration is to minimize a cost function consisting of a data term and a smoothness term. The data term describes how well the estimated parameters fit the observed color at each pixel. The smoothness term is claimed to penalize “inconsistent alpha value changes between two neighbors”, but in fact it just penalizes any strong change in alpha, because it only looks at the alpha gradient, ignoring the underlying image values. This is an iterative non-linear optimization process, so depending on the initial input scribbles it might converge to a wrong local minimum. Finally, the cost of this method is quite high: 15-20 minutes for a 640 by 480 image.