Background modelling involves modelling the visual content of a scene, based on an image sequence depicting the scene. Background modelling allows a video analysis system to distinguish between transient foreground objects and the non-transient background, through a background-differencing operation.
In one example, including a camera with a fixed field of view, the background model may be a first frame of an image sequence, based on the assumption that the first frame is known to contain non-transient content only. Background-differencing is then performed by subtracting a later frame in the image sequence from the background model to classify portions of the scene as foreground or background. Regions of an image that are different from the background model are classified as foreground and regions of an image that are similar to the background model are classified as background. The definitions of “different” and “similar” depend upon the background modelling method being used.
In another example of a background model, video frames depicting a scene comprise 8×8 DCT (Discrete Cosine Transform) blocks associated with each position in the scene. For each position, several mode models are maintained. The mode models relate to different visual content of the 8×8 blocks encountered at the same scene position, but at different points in time in the image sequence. If a block in a new incoming frame is similar to an existing mode model, the existing mode model can be updated. If the block in a new incoming frame is different from all existing mode models for the scene position, a new mode model is created and initialised with the values of the block in the new incoming frame.
If the closest matching mode in the background model is sufficiently similar to the DCT block in the incoming image sequence, the closest matching mode in the background model is updated with the DCT block data in the incoming image sequence. Otherwise, a new mode in the background model is created.
By keeping track of the temporal characteristics of a mode model and a count of the update frequency of the mode model, a decision can be made as to whether a mode model represents foreground or background. For example, if the time that has passed since the mode was created is longer than a predetermined threshold value, the mode may be classified as background. Otherwise, the mode is classified as foreground. In another example, a mode is classified as background if the mode was updated more times than a threshold value. If this is not the case and the mode had not been updated more times than a threshold value, the mode is classified as foreground.
The comparison between a block and a mode model is based on a similarity measure, rather than an exact match. The reason is that the captured representation of the real world varies even when the real world is constant. In addition, there can be small variations in the visual appearance of the real world while there is no semantic change in the scene. For example, a change in lighting changes the visual appearance of objects captured by a sensor. An example of background-differencing using a block/mode model similarity comparison method is the calculation of the weighted sum of the differences between modelled DCT coefficients and DCT coefficients of a block of a frame undergoing analysis. The calculated weighted sum is compared to a predefined threshold to determine whether the modelled DCT coefficients are sufficiently similar to the DCT coefficients of the block or not.
Background modelling systems may incorporate techniques to update a scene model (also referred to, throughout this specification, as a “background model”) based on a latest observed image from a frame that is being processed. The update techniques allow background modelling systems to cope with structural changes in the scene. For example, a new painting in a museum will initially be detected as a foreground object, but after some time the painting becomes part of the background model.
The assumption that a background modelling system is initialised with an empty scene is reasonable for security systems with fixed view cameras. A setup time to create a background model from one or more initial frames is insignificant compared to the months or years that a camera will observe the same scene. However, for other scenarios, such as a camera tour (where a pan-tilt camera alternates between various preset views) or consumers shooting short home video movies, or video snippets, the required setup time is too onerous. In addition, in the home video scenario, any setup time would reduce the spontaneity of the recording. For home videos, the value of the recording is often known only after the recording has finished, well after the time for any initialisation of a background model.
In one approach, a recording of video frames is made without initialising a background model based on an empty scene. When the recording is finished, the photographer waits until the scene is empty, and then makes a background photo that is used for subtraction from frames of the earlier recording to identify foreground objects. This approach is very sensitive to impulse noise, and thus results in low quality foreground separation. This approach is also sensitive to scene changes, such as lighting changes, that happen during the time that the photographer is waiting for the scene to become empty. Finally, this approach requires the photographer to keep the camera still for a long time, as movement of the camera during the recording results in misalignment between the recorded video and the background photo. The misalignment between the recorded video and the background photo in turn results in poor foreground separation.
Therefore, there is a need to provide a method that reduces the time that a scene is captured for the purpose of enabling foreground/background separation.