Video cameras are omnipresent nowadays, mostly for surveillance purposes. The cameras capture more data (video content) than human viewers can process in a timely manner. Therefore, there is a need for automatic analysis of video content. One step in a method for processing video content is the segmentation of video data into foreground objects and a background scene, or “background”. Such segmentation allows for further analysis of the video content, such as detection of specific foreground objects, or tracking of moving objects.
One type of video camera is a Pan-Tilt-Zoom (PTZ) camera, which has the capability to change its field of view via panning, tilting, zooming, or any combination thereof. PTZ cameras may change their field of view without human intervention, based on preset orientations, or even based on observed video content. For example, when a PTZ camera is tracking a walking person, the camera may pan to keep the person within the field of view of that camera. Therefore, automatic analysis is highly relevant to PTZ cameras.
Video object detection (VOD) approaches typically process one or more video frames from a video sequence and often create a background model based on video frames that have previously been processed, pre-populated training data, or a combination thereof. VOD approaches process content from an input video frame to classify blocks in the input video frame as belonging to either foreground or background of a scene captured by the input video frame, based on an associated background model. VOD approaches usually assume that an input frame from a video sequence is aligned with previous frames in the video sequence. A misalignment between the input frame and the background model can affect the matching of the input frame with the background model and consequently may result in the false classification of blocks in the input frame. Thus, misalignment estimation between a background model and an input frame is required to avoid falsely classifying blocks as non-background due to misalignment.
One existing misalignment estimation approach estimates the misalignment between an input frame from a video sequence and previous frames in the video sequence by using the input frame and a reference frame representing the background. The disadvantage of this approach is that a reference frame representing the background is required. Such reference frames are not available in VOD algorithms that use a multiple models based background modelling approach.
Another existing misalignment estimation approach uses the past two processed frames for misalignment estimation. A disadvantage of this approach is its reliance on the assumption that the past two processed frames are a good representation of a current input frame. However, this assumption is not always valid. For example, if a video camera is pointing at a lift door and the previous two processed frames captured a scene with a closed door scenario and a current input frame shows a scene with an open door scenario, then the two previously processed frames are not a good representation of the current input frame for the purposes of misalignment estimation.
Thus, a need exists for an improved method for video object detection in the presence of misalignment.