A video is a sequence of images. The images may also be referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout this specification to describe a single image in an image sequence, or a single frame of a video.
An image is made up of pixels where each pixel is represented by one or more values representing the visual properties at that pixel. For example, in one scenario three (3) values are used to represent Red, Green and Blue colour intensity at the pixel.
Scene modelling, which covers both background modelling and foreground modelling, involves modelling visual content of a scene, based on an image sequence depicting the scene. A usage of scene modelling is foreground segmentation by background subtraction. Foreground segmentation may also be described by its inverse (i.e., background segmentation). Examples of foreground segmentation applications include activity detection, unusual object or behaviour detection, and scene analysis.
Foreground segmentation allows a video analysis system to distinguish between transient foreground objects and the non-transient background, through scene modelling of the non-transient background, and a differencing operation between that background and incoming frames of video. Foreground segmentation can be performed by using scene modelling and identifying portions of the modelled scene which are either moving, or recently changed/added, or both.
To model a scene captured by a video camera, for example, the content of a captured image is often divided into one or more visual elements, and a model of the appearance of each visual element is determined. A scene model may maintain a number of models for each visual element location, each of the maintained models representing different modes of appearance at each location within the scene model. Each of the models maintained by a scene model may be known as “mode model”. For example, there might be one mode model for a visual element in a scene with a light being on, and a second mode model for the same visual element at the same location in the scene with the light off.
The description of a mode model may be compared against the description of an incoming visual element at the corresponding location in an image of the scene. The description may include, for example, information relating to pixel values or DCT coefficients. If the description of the incoming visual element is similar to one of the mode models, then temporal information about the mode model, such as age of the mode model, helps to produce information about the scene. For example, if an incoming visual element has the same description as a very old visual element mode model, then the visual element location can be considered to be established background. If an incoming visual element has the same description as a young visual element mode model, then the visual element location might be considered to be background or foreground depending on a threshold value. If the description of the incoming visual element does not match any known mode model, then the visual information at the mode model location has changed and the location of the visual element can be considered to be foreground.
Depending on the scene modelling method, a visual element can refer to a single pixel, an M×N block of pixels or a group of connected pixels (also known as a superpixel). The visual element location can refer to the location of a single pixel, or the location of the top-left corner of each M×N block of pixels or the centroid location of the group of connected pixels. The description of the visual element may contain but not be limited to the average colour intensities observed at the visual element, and/or a set of texture measures around the visual element. In general, any set of features computed over the visual element can be used to describe the visual element.
Scene modelling maintains a number of mode models per visual element; each corresponding to a description of the visual element. Some of these mode models describe the non-transient part of the scene, also known as the background. Other mode models describe the transient part of the scene, also known as the foreground. A dynamic scene modelling method also updates these mode models using the visual properties of incoming images. This updating step ensures the scene model is up to date with the dynamic changes happening in the scene including but not limited to illumination changes, or permanent changes to the background content such as addition, removal or one-off movement of fixed objects.
In one scene modelling method, a mixture of Gaussian (MoG) modes is used to describe the intensity values at each pixel. Each Gaussian in the mixture is represented by an average μ, a standard deviation σ and a mixture weight ω. The sum of all mixture weights for each MoG equals to one. At each pixel location, the incoming intensity is matched to all Gaussians in the mixture. If the distance between the incoming intensity I and the Gaussian mode is within 2.5 standard deviation of a Gaussian distribution |I−μl≤2.5σ, the incoming intensity is said to match the Gaussian mode. The incoming intensity I is then used to update all matched modes, where the amount of update is inversely proportional to how close I is to the mode average μ. This update scheme, which updates multiple modes at a time, can potentially bring two modes closer to each other to a point where the two modes have similar averages. Such converged modes result in waste of memory due to mode duplication. If the two modes are a foreground and a background mode, mode convergence causes foreground colour bleeding into the background or vice versa. Bleeding of foreground colour into the dominant background mode is visually noticeable as ghosts in the background image.
In another MoG scene modelling method, mode convergence is reduced by updating only the closest mode if the incoming intensity is inside the 99% confidence interval of the closest mode. If none of the modes are updated, the mode with the largest variance (i.e. least confident) is replaced with a new Gaussian mode centred at the incoming intensity. This closest mode update scheme results in multiple non-overlapping Gaussian distributions with narrow standard deviation. As a result, modes are less likely to merge but at a cost of more memory to store spurious modes in between the dominant background and foreground intensities. In particular, three to five modes are used in normal cases, and more modes are required in more dynamic scenes. If each mode requires three images (μσ, ω) in double-precision format, memory to accommodate nine (9) to fifteen (15) floating-point images is required in total.
In yet another scene modelling method, only one mode is modelled. The single mode is updated using background-classified intensities only. This single mode can model the background image well if the single mode is initialised with a clean background and the background remains static the whole time. The single mode model is also memory efficient. However, if part of the background changes, for example a long-term stationary object like a car moves away after initialisation, the new background behind the change will forever be detected as foreground. This foreground detection is commonly known as the permanent ghost problem.
None of the above-mentioned scene modelling methods can be both memory efficient and produce an artefact-free background image. In addition to correct foreground segmentation, an artefact-free background image is important in many applications such as free-viewpoint video synthesis from multiple fixed-viewpoint video streams. Hence, there is a need for a scene modelling method which has relatively low storage and computation cost as well as accurate foreground segmentation and artefact-free background extraction.