A video is a sequence of images. The images may also be referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout this specification to describe a single image in an image sequence, or a single frame of a video. An image is made up of pixels where each pixel is represented by one or more values representing the visual properties at that pixel. For example, in one scenario three (3) values are used to represent the visual properties of a pixel, namely Red, Green and Blue colour intensity of each pixel.
The terms foreground objects and foreground refer to transient objects that appear in a scene captured on video. Such transient objects may include, for example, moving humans. The remaining part of the scene is considered to be background, even where the remaining part includes minor movement, such as water ripples or grass moving in the wind.
Scene modelling, also known as background modelling, involves modelling the visual content of a scene, based on an image sequence depicting the scene. One use of scene modelling is foreground segmentation by means of background subtraction. Foreground segmentation is also known as foreground/background separation. Foreground segmentation may also be described by its inverse (i.e., background segmentation). Examples of foreground segmentation applications include activity detection, unusual object or behaviour detection, and scene analysis.
Foreground segmentation allows a video analysis system to distinguish between transient foreground objects and the non-transient background through scene modelling of the non-transient background, and a differencing operation between that background and incoming frames of video. Foreground segmentation can be performed by using scene modelling and identifying portions of the modelled scene which are either moving, or recently changed/added, or both.
In one scene modelling method, the content of an image is divided into one or more visual elements, and a model of the appearance of each visual element is determined. Examples of possible visual elements include a pixel, or an 8×8 DCT block. Another representation of a visual element is a superpixel visual element. A scene model may maintain a number of models for each visual element location, with each of the maintained models representing different modes of appearance at each location within the scene model. The models maintained by a scene model are known as mode models, and mode models that correspond to background visual elements are known as background modes. For example, there might be one mode model for a visual element in a scene with a light being on, and a second mode model for the same visual element at the same location in the scene with the light off.
The visual elements in an input image are matched with mode models of the scene model. A “match mode” is identified for each visual element as the output of scene model matching, e.g. the mode having the best match to the input visual element. A visual element is classified as foreground or background depending on the age temporal characteristic of the match mode. The “Age” of a mode refers to duration since the time the mode was created. In one method, age is represented in terms of number of frames passed since mode is created. The age of the match mode is compared to a pre-defined age threshold to classify a visual element as foreground or background.
Scene Modelling Techniques are Adaptive
Generally, the background in a video scene is always changing. An example is an outdoor scene where the scene changes as time passes from morning to night, and changes as time passes from night to morning again. Many scene modelling techniques are adaptive and update the scene model to learn changes in background. In one approach a learning rate is used to learn changes in the scene. Another approach uses an age based approach. A visual element is classified as background if the matched mode has an age that is greater than a predefined threshold. In general, a change in the scene (ie a region of the scene which changes) will be learned and the change will be classified as being background if the change remains static for a pre-defined minimum amount of time.
A particular challenge to adaptive scene modelling is how to handle the background merging/revealed background situation. The background merging situation refers to scenarios in which a foreground object is merged into the background due to being static for an extended amount of time. An example of a background merging situation is a bus stop scenario. In this example a bus comes to the stop and remains static at the stop for some time. If the bus stays static for a time period greater than an age threshold, then the scene model will be updated to learn and conclude that the bus is a background object. This process of learning and classifying the bus as background may result in a subsequent missed detection of the bus, when the bus comes back to the stop in the future.
The revealed background situation refers to a scenario in which a part of the background scene was initially occluded by an object. Later, however, the object has moved, thereby revealing background which was previously occluded. Existing adaptive scene modelling techniques will detect the revealed area initially as being foreground and will require some time to learn and classify the revealed area as being background. The time required depends on the value of the age threshold parameter.
An existing approach to deal with background merging and the revealed background situation is via adapting the learning rate of the scene model depending on the colour gradient) of the input frame and the background image along the contour of a foreground blob. Foreground blobs are generated by performing a connected component analysis of foreground visual elements. A complete group of connected foreground visual elements is referred to as a blob. However, such techniques do not operate well in scenarios where multiple background modes exist. Another challenge is robustness of the existing techniques to spurious detection in some frames due to per-frame dependency of the learning rate on blob characteristics which may result in spikes in the scene model learning rate.
Thus, a need exists to provide an improved approach for scene modelling, that is both robust to background merging/revealed background situation and that is also relatively computationally inexpensive.