A video is a sequence of images. The images may also be referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout this specification to describe a single image in an image sequence, or a single frame of a video.
An image is made up of pixels where each pixel is represented by one or more values representing the visual properties at that pixel. For example, in one scenario three (3) values are used to represent Red, Green and Blue colour intensity at the pixel. In another scenario, YCbCr values are used to represent the luma component and the chroma components at the pixel.
Scene modelling, which covers both background modelling and foreground modelling, involves modelling visual content of a scene, based on an image sequence depicting the scene. A usage of scene modelling is foreground segmentation by background subtraction. Foreground segmentation may also be described by its inverse (i.e., background segmentation). Examples of foreground segmentation applications include activity detection, unusual object or behaviour detection, and scene analysis.
Foreground segmentation allows a video analysis system to distinguish between transient foreground objects and the non-transient background, through scene modelling of the non-transient background, and a differencing operation between that background and incoming frames of video. Foreground segmentation can be performed by using scene modelling and identifying portions of the modelled scene which are either moving, or recently changed/added, or both.
To model a scene captured by a video camera, for example, the content of a captured image is often divided into one or more visual elements, and a model of the appearance of each visual element is determined. A scene model may maintain a number of models for each visual element location, each of the maintained models representing different modes of appearance at each location within the scene model. Each of the models maintained by a scene model is known as “mode model” or “scene model”. For example, there might be one mode model for a visual element in a scene with a light being on, and a second mode model for the same visual element at the same location in the scene with the light off.
The description of a mode model may be compared against the description of an incoming visual element at the corresponding location in an image of the scene. The description may include, for example, information (e.g., average intensity value, variance value, appearance count of the average intensity value, etc.) relating to pixel values or DCT coefficients. If the description of the incoming visual element is similar to one of the mode models, then temporal information about the mode model, such as age of the mode model, helps to produce information about the scene. For example, if an incoming visual element has the same description as a very old visual element mode model, then the visual element location can be considered to be established background. If an incoming visual element has the same description as a young/recent visual element mode model, then the visual element location might be considered to be background or foreground depending on a threshold value. If the description of the incoming visual element does not match any known mode model, then the visual information at the mode model location has changed and the location of the visual element can be considered to be foreground.
Depending on the scene modelling method, a visual element can refer to a single pixel, an M×N block of pixels or a group of connected pixels (also known as a superpixel). The visual element location can refer to the location of a single pixel, or the location of the top-left corner of each M×N block of pixels or the centroid location of the group of connected pixels. The description of the visual element may contain but not be limited to the average colour intensities observed at the visual element, and/or a set of texture measures around the visual element. In general, any set of features computed over the visual element can be used to describe the visual element.
Scene modelling maintains a number of mode models per visual element; each corresponding to a description of the visual element. Some of these mode models describe the non-transient part of the scene, also known as the background. Other mode models describe the transient part of the scene, also known as the foreground. A dynamic scene modelling method also updates these mode models using the visual properties of incoming images. This updating step ensures the scene model is up to date with the dynamic changes happening in the scene including but not limited to illumination changes, or permanent changes to the background content such as addition, removal or one-off movement of fixed objects.
In one scene modelling method, a mixture of Gaussian (MoG) modes is used to describe the intensity values at each pixel. Each Gaussian in the mixture is represented by an average y, a standard deviation a and a mixture weight ω. The mixture weight ω is proportional to the frequency of appearance of the corresponding intensity mode. The sum of all mixture weights for each MoG equals to one. At each pixel location, the incoming intensity is matched to all Gaussians in the mixture. If the distance between the incoming intensity I and the Gaussian mode is within 2.5 σ (standard deviation) of a Gaussian distribution |I−μ|≤2.5σ, the incoming intensity is said to match the Gaussian mode. The incoming intensity I is then used to update all matched modes, where the amount of update is inversely proportional to how close I is to the mode average μ. This update scheme, which updates multiple modes at a time, is inefficient and can potentially bring two modes closer to each other to a point where the two modes have similar averages. Such converged modes result in waste of memory due to mode duplication. In general, three to five Gaussian modes are used to model a scene depending on scene dynamics. If each mode requires the three parameters (μ, σ, ω) in double-precision format, 9 to 15 floating-point values are required by the MoG in total for the respective 3 to 5 Gaussian modes.
A second scene modelling method, which uses a convolutional neural network (CNN), varies in architectures. In one example, a CNN includes two parts, a convolution network and a deconvolution network. The convolution network has a sequence of convolution layers of various sizes that extract features from an input frame and transform the frame into a multi-dimensional feature representation. In the deconvolution network, a sequence of convolution layers of various sizes produce a probability map from this multi-dimensional feature representation. The probability map has the same dimension as the input frame. The probability map indicates the probability of each pixel in the input frame being part of the foreground. Multiple filters perform convolution on each of the convolution layers. As a result, the amount of processing and memory required grow proportionally to the complexity of the CNN architecture. To reduce the area to check for foreground objects, a method utilises a user defined region of interest. When a moving object in the image overlaps with the user defined region, their scene modelling method designates the object as foreground. Generally, the region of interest occupies only a portion of the input frame. Thus, the amount of processing is reduced. However, this requires a user to input the region of interest, which is time consuming and limits the flexibility in handling input frames from non-stationary cameras.
To define region of interest automatically, a method uses a histogram technique to classify pixels in an input frame. The method uses two histograms, one for pitch colours and another for non-pitch colours. A number of training frames are used to populate the histograms. The pixels in the training frames are labelled either as pitch pixels or non-pitch pixels. The labelling process is performed manually or by using a semi-supervised method. Pixels labelled as pitch in the training frames are added to the pitch colour histogram. Whereas, pixels labelled as non-pitch are added to the non-pitch colour histogram. After the training, the probability function of a colour being part of a pitch becomes:
      P    ⁡          (              c        |        pitch            )        =                    H        pitch            ⁡              (        c        )                    sum      ⁡              (                  H          pitch                )            
Where Hpitch (c) is the number of pixels with the colour c in the pitch histogram, and sum(Hpitch) is the total number of pixels in the pitch histogram.
Similarly, the probability function of a colour being non-pitch is:
      P    ⁡          (              c        |                  non_          ⁢          pitch                    )        =                    H                  non          ⁢          _          ⁢          pitch                    ⁡              (        c        )                    sum      ⁡              (                  H                      non            ⁢            _            ⁢            pitch                          )            
Thus, a pitch pixel classifier to determine if a colour c is to be labelled as the pitch area becomes:
            P      ⁡              (                  c          |          pitch                )                    P      ⁡              (                  c          |                      non_            ⁢            pitch                          )              ≥  threshold
Where threshold is user defined.
The pitch pixel classifier is used to determine the preliminary pitch area in an input frame. The method further refines the preliminary pitch area by applying a morphological opening operation to remove small false positive noise. Enclosed within the refined pitch area are black areas left by potential foreground objects. These black areas become the regions of interest that are further processed to extract foreground objects.
The pitch pixel classifier provides background culling to reduce the processing of the scene modelling method. However, as foreground objects have to be completely within the pitch area to form holes, foreground objects that intersect with the pitch area's boundary cannot be detected by the method.
None of the above mentioned methods can determine regions of interest fully automatically that also include all foreground objects on the pitch. To overcome these deficiencies, there is a need for a fast background culling method to determine regions of interest that include all foreground objects on a sporting pitch while using less resource than a scene modelling method, such as MoG or CNN.