A video is a sequence of images. The images are often referred to as frames. The terms ‘frame’ and ‘image’ are used interchangeably throughout the art and this specification to describe a single image in an image sequence, or a single frame of a video. If a video is captured by a static camera, the background content of the video is said to be stable. Even if the camera is mounted on a fixed platform, the camera can undergo some motion if the platform is shaken or is unstable. The camera motion affects the whole frame, which appears shifted from a previous frame. Shake or instability can therefore be detected by motion estimation between different, often adjacent, frames from the video.
If the shake is of small magnitude compared to the frame size or of low-frequency compared to the frame rate, motion between shaky frames can be approximated by a global translation. Some prior art methods rely on two-dimensional (2D) motion estimation between adjacent video frames to detect shake. However, there are several problems to this global motion estimation approach. First, moving objects in the scene can lead to inaccurate global motion estimation. The larger of moving object, the more bias that object adds to the estimated global motion. Second, dynamic background can also cause disturbance to the estimated global motion. Dynamic background refers to the repetitive motion of some background elements, such as outdoor tree movement or indoor rolling escalators. Prior art methods that estimate global motion from bottom-up (i.e. from block-based motion estimation) may be susceptible to this cyclic motion of the dynamic background. Third, low signal-to-noise ratio (SNR) may make accurate motion estimation difficult. Low SNR cases include, but are not limited to, high noise in low-lit scene, saturated highlights in night scenes, and low texture scenes. Low SNR is a common problem to all scene-based shake detection methods.
The detection of shake is useful for many applications, one of which is video change detection using background modelling. Because camera shake shifts image content, the whole shaky frame can be detected as change and the background model is not updated accordingly. This not only causes false detection in the current frame but also leads to a degraded performance of the system over subsequent frames due to the non-updated background model.
As mentioned above, most prior art methods rely on 2D motion estimation for shake detection. 2D shift estimation is costly for large image sizes (e.g. VGA resolution at 640×480 pixels, or full HD resolution at 1920×1080 pixels). Although the frame can be subsampled to a smaller size, the estimated motion on a subsampled image is less accurate. Another way to avoid motion estimation on the large video frames directly is to partition the frames into small image blocks, for which the motion of each can be estimated separately. For videos encoded with temporal information (e.g. in MPEG-2 or H.264 codec), the quantised motions of macro blocks are used by some prior art methods for background stabilisation or foreground tracking. The 16×16-pixel macro blocks are, however, too small for a reliable motion estimation. As a result, one prior art method groups several adjacent macro blocks together to form a large region. The macro-block motions are then combined for a more reliable motion estimation of the region. One limitation of using motion vectors from the compression stream is that these macro-block motion vectors may not be very accurate. Macro-block motion estimation is known to optimise for compression ratio rather than accuracy.
If the aggregation of macro-block motions is seen as a bottom-up approach in shake detection, a top-down approach would successively sub-divide the image for the purpose of motion detection. Quad-tree motion estimation is an example of this top-down approach, in which a motion vector is estimated for each of the four image quadrants. The quadrant division is accepted if the total sum of squared errors (SSE) of the quadrant pairs after motion compensation reduces compared to the SSE of the image pair before division. Each quadrant is sub-divided further if the SSE further reduces. This selective quad-tree subdivision results in an unbalanced quad-tree partitioning of the input video frame into rectangular regions, where each undergoes a homogeneous translational motion. These sparse set of translational motion vectors can be interpolated to a full warp map. The technique can be adapted for shake detection and global background motion estimation. One limitation of this quad-tree motion estimation technique is the motion of each image patch is estimated using 2D phase correlation, which requires transforming the input images to the Fourier domain and back to the spatial domain.
To avoid the computational complexity and high memory requirement of 2D motion estimation, some prior art methods perform independent shift estimation along the X- and Y-dimension using image projections. The 2D intensity or gradient energy image is accumulated along the X- and Y-dimensions to provide 1D profiles along the Y- and X-dimensions, respectively. The 1D profiles between 2 images are then aligned to find the translations in X and Y. Image registration from axis-aligned projections works because natural scenes usually contain horizontal and vertical structures, which show up as alignable peaks in the gradient projection profiles. Additionally, the projections can be accumulated along the 45 and 135 degree orientation to improve the alignment accuracy. Projection-based motion estimation can be used for shake detection, however the projection profiles are easily corrupted under moving foreground object or dynamic background, making the estimated motion unreliable.
To reduce the influence of moving foreground objects in global motion estimation, some prior art methods keep an updated background model for motion estimation against an incoming video frame. The background is updated with an aligned incoming frame using block based motion estimation. These methods also detect the foreground and mask the detected foreground out when updating the background to avoid corrupting the background model with foreground intensities. By aligning the current frame to the background model instead of a previous frame, the foreground objects in the previous frame do not corrupt the motion estimation. However, the foreground objects in the current frame still can corrupt the estimated motion.
While camera shake detection can be used to improve change detection in video, there are prior art change detection methods that avoid shake detection altogether. Robustness to small shake is built into the change detection mechanism by matching an incoming pixel with a set of neighbouring pixels in the background model. As long as the local motion caused by shake is smaller than the size of the matching neighbourhood, the change detection method can avoid the false detection due to global motion. There is one drawback to this neighbourhood matching method in that slowly moving foreground objects may become blended into the background model and thus would not be detected as change.
None of the above-mentioned shake detection method can achieve real-time, accurate shake detection in the presence of large moving foreground objects. Hence, there is a need for an improved shake detection method, and particularly within a system that is efficient, and hardware-friendly, and where the detection is robust to dynamic content in the scene.