The detection and tracking of objects of interest in a video sequence, like the principal persons in a movie or the most important actions in a broadcast football match, allows for knowing the position and the trajectories of these objects within the video. This knowledge is essential for the automatic summarization of videos. The summarization of videos has several purposes, for example, in video surveillance applications, video indexing, or other interactive multimedia applications requiring the management of video content.
When dealing with videos captured with a non-moving camera, the objects of interest can be detected using background subtraction techniques. An example of such a technique is disclosed in S. Conseil et al., “Suivi Tridimensionnel en Stéréovision”, GRETSI, 2005, wherein the background is taken as a reference image subtracted from all frames in order to detect a human hand.
However, background subtraction methods are not suitable in the case of video sequences captured with a moving camera since the background is susceptible to change from one scene or frame to another scene or frame and may not be used as a reference. There exist several alternative methods and devices for tracking objects in videos captured with a moving camera. Some examples thereof are described in the following.
According to a first technique, the user tags the object of interest, i.e., the user manually selects a target object of interest in a frame via a graphical user interface (GUI). Motion and appearance models are then used to follow the selected object across the video sequence in order to know its trajectory.
A second approach enabling the automatic tracking of objects in a video sequence captured with a moving camera, disclosed in U.S. Pat. No. 5,867,584, requires the user to specify a window including the object. This window is then compared to test windows in the subsequent frames in order to find the best match window which is most similar with the specified window containing the object.
In a third approach, a priori knowledge of the object to be tracked is required, for example, in the form of databases containing features associated with the object. The system learns a model of the objects of interest in advance, thereby enabling it to detect similar objects within the frames of the video. The database of the system comprises a number of training samples, for example, different kinds of human faces, in order to find faces in the video. A data association step is subsequently performed to link detections of the same object across the video frames into trajectories or tracks. An example of this third approach may be found in T. Ma, L. J. Latecki, “Maximum Wright Cliques with mutex Constraints for Object Segmentation”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2012.
With the techniques described above, either user interaction and/or prior knowledge of the objects of interest is required, or the types of objects that may be detected are limited, for example by the category of the object the system has been trained for.
According to the foregoing, there exists a need for improved automatic detection and tracking of objects of interest in videos captured with a moving camera, without the input of a priori knowledge and independently of databases required for learning models.