A video is a sequence of images. The images may also be referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout this specification to describe a single image in an image sequence, or a single frame of a video.
An image is made up of pixels where each pixel is represented by one or more values representing the visual properties at that pixel. For example, in one scenario three (3) values are used to represent Red, Green and Blue colour intensity at the pixel. In another scenario, YCbCr values are used to represent the luma component and the chroma components at the pixel.
Scene modelling, which covers both background modelling and foreground modelling, involves modelling visual content of a scene, based on an image sequence depicting the scene. A usage of scene modelling is foreground segmentation by background subtraction. Foreground segmentation may also be described by its inverse (i.e., background segmentation). Examples of foreground segmentation applications include activity detection, unusual object or behaviour detection, and scene analysis.
Foreground segmentation allows a video analysis system to distinguish between transient foreground objects and the non-transient background, through scene modelling of the non-transient background, and a differencing operation between that background and incoming frames of video. Foreground segmentation can be performed by using scene modelling and identifying portions of the modelled scene which are either moving, or recently changed/added, or both.
To model a scene captured by a video camera, for example, the content of a captured image is often divided into one or more visual elements, and a model of the appearance of each visual element is determined. A scene model may maintain a number of models for each visual element location, each of the maintained models representing different modes of appearance at each location within the scene model. Each of the models maintained by a scene model is known as “mode model” or “scene model”. For example, there might be one mode model for a visual element in a scene with a light being on, and a second mode model for the same visual element at the same location in the scene with the light off.
The description of a mode model may be compared against the description of an incoming visual element at the corresponding location in an image of the scene. The description may include, for example, information (e.g., average intensity value, variance value, appearance count of the average intensity value, etc.) relating to pixel values or DCT coefficients. If the description of the incoming visual element is similar to one of the mode models, then temporal information about the mode model, such as age of the mode model, helps to produce information about the scene. For example, if an incoming visual element has the same description as a very old visual element mode model, then the visual element location can be considered to be established background. If an incoming visual element has the same description as a young/recent visual element mode model, then the visual element location might be considered to be background or foreground depending on a threshold value. If the description of the incoming visual element does not match any known mode model, then the visual information at the mode model location has changed and the location of the visual element can be considered to be foreground.
Depending on the scene modelling method, a visual element can refer to a single pixel, an M×N block of pixels or a group of connected pixels (also known as a superpixel). The visual element location can refer to the location of a single pixel, or the location of the top-left corner of each M×N block of pixels or the centroid location of the group of connected pixels. The description of the visual element may contain but not be limited to the average colour intensities observed at the visual element, and/or a set of texture measures around the visual element. In general, any set of features computed over the visual element can be used to describe the visual element.
Scene modelling maintains a number of mode models per visual element; each corresponding to a description of the visual element. Some of these mode models describe the non-transient part of the scene, also known as the background. Other mode models describe the transient part of the scene, also known as the foreground. A dynamic scene modelling method also updates these mode models using the visual properties of incoming images. This updating step ensures the scene model is up to date with the dynamic changes happening in the scene including but not limited to illumination changes, or permanent changes to the background content such as addition, removal or one-off movement of fixed objects.
In one scene modelling method, a mixture of Gaussian (MoG) modes is used to describe the intensity values at each pixel. Each Gaussian in the mixture is represented by an average μ, a standard deviation a and a mixture weight ω. The mixture weight ω is proportional to the frequency of appearance of the corresponding intensity mode. The sum of all mixture weights for each MoG equals to one. At each pixel location, the incoming intensity is matched to all Gaussians in the mixture. If the distance between the incoming intensity I and the Gaussian mode is within 2.5σ (standard deviation) of a Gaussian distribution |I−μ|≤2.5σ, the incoming intensity is said to match the Gaussian mode. The incoming intensity I is then used to update all matched modes, where the amount of update is inversely proportional to how close I is to the mode average μ. This update scheme, which updates multiple modes at a time, is inefficient and can potentially bring two modes closer to each other to a point where the two modes have similar averages. Such converged modes result in waste of memory due to mode duplication. In general, three to five Gaussian modes are used to model a scene depending on scene dynamics. If each mode requires the three parameters (μ, σ, ω) in double-precision format, 9 to 15 floating-point values are required by the MoG in total for the respective 3 to 5 Gaussian modes. Traditional MoG background modelling methods also do not distinguish moving cast shadow from foreground pixels. Moving cast shadow from players in a sporting scene, for example, has similar frequency of appearance as the foreground players. As a result, the moving cast shadow is often classified as foreground. This may not be desirable in certain applications like player segmentation and tracking.
To handle shadow in background subtraction, some prior art methods use a weak shadow model to further distinguish shadow from foreground pixels. The scene is still modelled using a MoG scene modes as the traditional MoG methods. However, when an input colour I (colour intensities of R, G and B) satisfies a weak shadow model criteria with respect to an expected background colour BG at the same pixel:angle(I,BG)<Tangle  (1)Tratio1<|I|/|BG|<Tratio2  (2)the input colour I is re-classified as shadow, not foreground. By enforcing a small angle in the RGB colour space between the two colours I and BG, the weak shadow model defines a conic regions around the line connecting the origin (R,G,B)=(0,0,0) and the expected background colour (R,G,B)=(RB,GB,BB) with a conic angle of Tangle (e.g., Tangle=0.1 radian) The ratio of magnitude |I|/BG is limited between two thresholds Tratio1 and Tratio2 (e.g., Tratio1=0.4 and Tratio2=0.95) to present too dark or too bright colours with respect to the expected background colour being classified as shadow. The magnitude ratio |I|/|BG| is also referred to as the luminance ratio or luminance distortion, and the colour angle angle(I,BG) is related to the chrominance distortion used by other similar prior art for shadow detection. The weak shadow model is based on an observation that the shadow colour is a darker version of the same colour. While it is correct that the luminance of shadow is lower than that of the lit colour, there may also be a chromatic shift between the two colours. A common example is outdoor shadow on a clear sunny day, where the lit colours appear warm from the red hot sun and the shadow colours appear cool from the blue sky. This colour shift from warm to cool as a surface goes from fully lit to fully shaded is referred to as the chroma shift. A strong chroma shift can bring a shadow colour outside the conic region defined by the weak shadow model.
To better handle shadow in background modelling, one prior art method models shadow explicitly using extra MoG shadow modes. The scene is still modelled using a MoG scene modes as the traditional MoG methods. However, when an input colour I satisfies a weak shadow model criteria with respect to an expected background colour BG at the same pixel, the input colour I is used to update the MoG shadow modes. Once both the scene MoG and shadow MoG are established at every pixel, a shadow flow is computed for each background colour as the difference between the background colour and its corresponding shadow colour. A shadow flow lookup table is then constructed and maintained for every seen RGB colours. This shadow flow can model the chromatic shift due to different coloured light sources. However, to build this shadow flow lookup table takes a long time with double memory requirement compared to traditional MoG background modelling methods. The shadow flow lookup table cannot model background colours that were not previously seen by the MoG scene and shadow modes. As the illumination condition changes, as happens with sun movement during the day, previously computed shadow flow may no longer be correct.
None of the above-mentioned MoG scene modelling methods can handle moving cast shadow with a small memory and computational footprint. This is because they all aim to model a general scene. To improve the performance of background modelling for sport videos, domain knowledge such as similar sized players on a large homogeneous coloured playing field should be used. Hence, there is a need for a specialised scene modelling method specifically designed for sport videos which has relatively low storage and computation cost but high foreground and shadow segmentation accuracy.