A video is a sequence of images. The images are also referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout this specification to describe a single image in an image sequence. An image is made up of visual elements. Visual elements may be, for example, pixels or 8×8 DCT (Discrete Cosine Transform) blocks, as used in JPEG images.
Scene modelling, also known as background modelling, involves the modelling of the visual content of a scene, based on an image sequence depicting the scene. The content typically includes foreground content and background content, for which a distinction or separation of the two is often desired.
A common approach to foreground/background segmentation is background subtraction. Background subtraction allows a video analysis system to distinguish between transient foreground objects and the non-transient background, through a differencing operation. For example, a scene model may maintain a number of mode models for each block location of an input video frame, where each block location corresponds to a visual element. The description of a mode model is compared against the description of the current visual element at the corresponding location in the frame. The description may include, for example, information relating to pixel values or DCT coefficients. If the current visual element is similar to at least one of the mode models, the visual element is considered to belong to background. Otherwise, the visual element is considered to belong to a foreground object.
One method uses an initial classification, in the manner described above, to determine whether a visual element belongs to the background or to a foreground object, and then uses this determination as input to a final classification step. In the final classification step, this approach computes a final classification score for each visual element by taking into account the initial classification of the visual elements, correlated to neighbouring visual element to be classified. The challenge is to obtain the correlation between visual elements.
Another method of scene modelling uses full-frame cross-correlation between visual elements. In this method, the scene model contains a correlation of the visual elements that represent dynamic texture areas in the frames. The correlation model uses the difference between a frame and a prediction of that frame obtained from a preceding frame. This correlation model is learned from the frame transitions in a training sequence. Two areas of the scene transitioning at the same time will have the same correlation coefficient, regardless of other parameters associated with those two areas of the scene. This method requires training data, which presents a difficulty because appropriate training data is not always available.
In yet another method, a correlation is computed by comparing temporal characteristics. The temporal characteristics record information at each visual element. Examples of temporal characteristics are activity count (number of times the background model was considered representative for the input visual element), creation time (time stamp or frame number corresponding to the creation of the background model), or reappearance time (last time the background model was found not to be representative for the input visual element).
In particular, one method uses an activity count to assist in the classification. Two background mode models are correlated if the difference between the respective activity counts is smaller than a predetermined threshold. When a foreground object remains in the scene depicted in a video sequence and only a part of the object is moving, then new mode models will be created for the moving parts, whereas the same mode models will match the non-moving parts. In this situation, the activity count difference between the mode models representing the moving part and the mode models representing the non-moving part will be large. Consequently, these mode models will not correlate as desired, even though these mode models represent the same real world object.
Thus, a need exists for an improved method for video object detection in video image processing.
When a set of foreground visual elements have been identified, it is valuable to know what is “behind” that foreground. This information can be used to assist in foreground matting, to assist in frame matching for image alignment, or simply as a visual aid to a user.
One method of identifying the background is to take the oldest background mode models from each visual element model. In cases where the foreground simply occludes an established long-term or initialised background, this method will work. However, in a case where the region behind the current foreground has changed, the background returned will be false. For example, consider a scene that is initialised as empty (at t=0 s), and a car enters and parks at some later time, say t=30 s. At a still later time, say t=60 s, the car is considered to be part of the background. At a still-later time, say t=90 s, a person walks past occluding the car. The oldest background will be from t=0 s, and the background returned will not include the car, even though the car is, at t=90 s, part of the background.
Another method is to use a threshold time. For the previous example, if the threshold used is time t=60 s, then the correct background will be returned when the person walks past the car. The problem with such a method is the selection of an appropriate threshold, which will depend on the situation and the task. If an incorrect threshold time is used, say t=30 s, an incorrect background will be returned. The problem simply changes to one of finding an appropriate threshold time, and there are situations where no such threshold time exists, for example, if the person was in a different location from t=0 and only began moving to their new location after the car arrived.
Yet another method is to select a most-recently-seen background. This will work, with the previous example, but if the car drives away while the person is still there, then the returned background will still not be correct, as the returned background will show the car, whereas the desired background is the original.
Thus, a need exists for an improved method of identifying unseen background in a frame of a video.
When foreground elements cease to move, the distinction between them and the background becomes difficult to define. A particular difficulty is the differing requirements on this distinction for different applications, and achieving a sensible semantic segmentation. As mentioned above, estimation of a background image with the foreground removed is valuable to assist with matting, frame matching, or as a visual aid, and the same uses exist for the removal of some background elements. Furthermore, semantic grouping of different regions enables applications which trace the history of a region of the scene. In another application, the grouping of foreground and background regions by the arrival time of objects allows per-object statistics and transitions between foreground and background depending on the application.
One method to separate foreground from background, adapts the background continuously to the current frame content. A changed region will become part of the background when the adaptation of the background becomes sufficiently similar to the region. A particular difficulty with this approach is that different regions, and different parts of the region, will merge with the background at different times depending on their appearance.
Another method establishes an age threshold for each visual element, allowing a foreground region to merge with the background over time. A particular difficulty is a choice of threshold. Another particular difficulty is partial absorption of the region into the background, leading to partial, fragmented, or moving regions reported to later processing stages.
Yet another method groups all foreground visual elements at a given frame, and makes a decision on merging them into the background based on the averaged statistics of the entire region. A particular difficulty with such methods is that the region may actually be composed of multiple sub-regions, which would result in inaccurate averaged statistics.
Thus a need exists for an improved method of grouping related visible visual elements with known but currently unseen ones.