Segmenting foreground objects from the background in videos is of great interest in many imaging applications. In video conferencing, once the foreground and background are separated, the background can be replaced by another image, which can beautify the video and protect privacy. The segmented foreground can be compressed and transmitted more efficiently, for example, by using object-based video coding. As an advanced video editing tool, segmentation also allows people to combine multiple objects from different videos and create new and artistic results.
The foreground/background segmentation problem for monocular videos, videos captured with a single camera, has been studied in depth. One area of interest in videos captured by static or hand-held cameras is the filming of large moving non-rigid foreground objects. For instance, the foreground can be the head and shoulders of a talking person, or a dancing character. Typically a static background is not assumed in such a scenario, because the camera can be shaking and there can be moving objects in the background. On the other hand, it is typically assumed that the background objects stay roughly where they are, which excludes videos captured by a continuously-panning hand-held camera. The main challenges in such types of sequences are that part of the foreground and background objects may share similar colors. Additionally, the foreground objects are typically large, hence there can be substantial occlusions between the foreground and background objects. Lastly, the foreground objects are non-rigid. The motion patterns of the foreground objects can be very complex or rapid, which can easily cause confusion to the segmentation algorithm if they are not modeled correctly.
Unlike some works which utilize the depth information reconstructed from a stereo camera pair, monocular video foreground/background segmentation is under-constrained. One additional assumption that makes the monocular video foreground/background segmentation problem fully constrained is that the background scene is static and known a priori. In this case the segmentation problem becomes a background modeling and subtraction problem, which can be solved effectively by pixel level modeling using Gaussian distributions, mixture of Gaussians, non-parametric kernel density estimators and three state Hidden Markov Models (HMMs). A separate region-level or even object-level model can be added to improve the background modeling quality in dynamic scenes. Nevertheless, video segmentation based on background modeling can still be confused by moving background objects or motionless foreground objects.
Another popular assumption made in video segmentation is that the foreground and background objects have different motion patterns. Research in this category, termed layer-based motion segmentation (LBMS), has received much interest in past years. In LBMS, the general objective is to automatically extract the motion coherent regions with relative depth estimations. Primarily focusing on the general motion segmentation problem, existing approaches in LBMS are either computationally very expensive, requiring off-line learning and processing, or tend to generate many over-segmented regions. Thus, it is hard to form semantically meaningful objects using LBMS.