Video cameras, such as Pan-Tilt-Zoom (PTZ) cameras, are omnipresent nowadays, and are often used for surveillance purposes. The cameras capture more data (video content) than human viewers can process. Automatic analysis of video content is therefore needed.
The terms foreground objects and foreground refer to transient objects that appear in a scene captured on video. Such transient objects may include, for example, moving humans. The remaining part of the scene is considered to be background, even if the remaining part includes movement, such as water ripples or grass moving in the wind.
An important step in the processing of video content is the separation of the content of video frames into foreground objects and a background scene, or background. This process is called foreground/background separation. Such separation allows for further analysis, including, for example, the detection of specific foreground objects, or the tracking of moving objects. Such further analysis may assist, for example, in a decision to send an alert to a security guard.
One approach to foreground/background separation is background subtraction. In one example, a pixel value in a background model, also known as a scene model, is compared against a current pixel value at the corresponding position in an incoming frame. If the current pixel value is similar to the background model pixel value, then the pixel is considered to belong to the background; otherwise, the pixel is considered to belong to a foreground object. A challenge for such approaches is to perform accurate foreground/background separation in scenes that contain background that has a changing appearance. A common source of change in appearance relates to unstable textures, such as shaking trees, waving bushes, and rippling water. These phenomena are also known as dynamic backgrounds.
One foreground/background separation technique uses the aggregate brightness and the weighted sum of selected coefficients from Discrete Cosine Transform (DCT) blocks for foreground/background classification. In this technique, a block is considered to be foreground, if:                the difference of the aggregate brightness between the background and the input is large enough, and/or        the difference of the weighted sum of selected AC coefficients between the background and the input is large enough.        
Blocks that are determined to be foreground are grouped together in a connected component step to form one or more “blobs”. A blob is reclassified as a background area with unstable textures if that blob has a high ratio of: (i) foreground blocks due to the difference of the weighted sum of the selected AC coefficients, relative to (ii) foreground blocks due to the difference in the aggregate brightness. Such background area blobs are removed in a post-processing step. However, an incorrect decision will cause entire detections to be incorrectly removed, if the detected object is incorrectly identified as an area with unstable textures. Alternatively, entire detections may incorrectly be kept, if the blob is not correctly identified as background. This leads to misdetections. Furthermore, if a connected component containing a real object (e.g., a human) merges with a connected component containing unstable textures (e.g., rippling water), then the post-processing step can only make a ratio-decision affecting the entire component; the post-processing step cannot filter the part of the merged blob that is due to the unstable textures from the part that is due to foreground. This results in a lost detection, or an incorrectly sized detection.
Another method divides the image into equal-sized blocks of pixels, and within each block, clusters of homogenous pixels (pixel clusters) are modelled with Gaussian distributions. Pixel homogeneity is defined by the colour of the pixels in the cluster. Each block is associated with one or more pixel clusters. When attempting to match an input pixel at a block location to a pixel cluster, the pixel is first attempted to be matched visually to the pixel clusters that overlap with the position of the pixel. If the pixel does not visually match any of the pixel clusters that overlap with the position of the pixel, then other neighbouring pixel clusters that do not overlap with the position of the pixel within the block are used to attempt a match to the pixel. If the pixel matches a neighbouring cluster, then it is assumed that the pixel is a dynamic background detection, such as swaying trees or rippling water, and the pixel is considered to be part of the background. This technique is more computationally expensive that the DCT block-based technique described above.
Another method uses neighbourhood-matching techniques combined with the background subtraction of pixels. If an input pixel does not match the corresponding background model pixel, then a neighbourhood of pixels in the background model is searched to determine if the input pixel matches any pixel in that neighbourhood. The neighbourhood area searched is centred around the corresponding background model pixel. If the input pixel matches a background model pixel in the neighbourhood, then the pixel is assumed to be a dynamic background detection and the pixel is considered to be part of the background. Such techniques are computationally expensive, as these techniques increase the number of matches performed for each input pixel.
A method to model unstable textures attempts to model the dynamic areas by behaviour, instead of by appearance. To use such a method for background subtraction, a metric is required to test whether an incoming pixel conforms to the learned behaviour in the corresponding area of the model. This modelling of the unstable textures uses an autoregressive moving average (ARMA) model to model the scene behaviour, compiling a mathematical model of how pixel values at any point in time are affected by other pixel values across the frame. Such techniques are very computationally expensive and can only be used after a captured length of video is complete.
Thus, a need exists to provide an improved method and system for separating foreground objects from a background in a scene with unstable textures.