A video is a sequence of images. The images may also be referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout this specification to describe a single image in an image sequence, or a single frame of a video. An image is made up of pixels where each pixel is represented by one or more values representing the visual properties at that pixel. For example, in one scenario three (3) values are used to represent Red, Green and Blue colour intensity at the pixel.
Scene modelling, also known as background modelling, involves modelling visual content of a scene, based on an image sequence depicting the scene. A usage of scene modelling is foreground segmentation by background subtraction. Foreground segmentation may also be described by its inverse (i.e., background segmentation). Examples of foreground segmentation applications include activity detection, unusual object or behaviour detection, and scene analysis.
Foreground segmentation allows a video analysis system to distinguish between transient foreground objects and the non-transient background, through scene modelling of the non-transient background, and a differencing operation between that background and incoming frames of video. Foreground segmentation can be performed by using scene modelling and identifying portions of the modelled scene which are either moving, or recently changed/added, or both.
In one scene modelling method, the content of an image is divided into one or more visual elements, and a model of the appearance of each visual element is determined. A scene model may maintain a number of models for each visual element location, each of the maintained models representing different modes of appearance at each location within the scene model. Each of the models maintained by a scene model are known as “mode models” or “background modes”. For example, there might be one mode model for a visual element in a scene with a light being on, and a second mode model for the same visual element at the same location in the scene with the light off.
The description of a mode model may be compared against the description of an incoming visual element at the corresponding location in an image of the scene. The description may include, for example, information relating to pixel values or DCT coefficients. If the description of the incoming visual element is similar to one of the mode models, then temporal information about the mode model, such as age of the mode model, helps to produce information about the scene. For example, if an incoming visual element has the same description as a very old visual element mode model, then the visual element location can be considered to be established background. If an incoming visual element has the same description as a young visual element mode model, then the visual element location might be considered to be background or foreground depending on a threshold value. If the description of the incoming visual element does not match any known mode model, then the visual information at the mode model location has changed and the location of the visual element can be considered to be foreground.
In one method, a visual element is a single pixel. Scene modelling using single pixel visual elements has disadvantages of high storage and computation cost due to pixel level modelling and comparison.
In another method, a group of 8×8 pixels is used as a visual element, referred to as block based scene modelling. The block based scene modelling has lower storage and computation cost compared to single pixel based method but suffers from blocky foreground segmentation. Additionally, the block based method has low robustness against shaky videos, e.g. caused by building tremors for a building mounted camera or caused by the environment for pole mounted cameras.
Hence, there is a need for a scene modelling method which has relatively lower storage and computation cost as well as higher robustness against shaky videos and more accurate object outlines.