Advances in technology have allowed cellular telephones, or smartphones, to include high-quality cameras, which allow recording at any moment in anyone's day. It has become increasingly popular to record video using cellular telephones at public events, such as concerts, theater performances and/or sporting events and then store the captured media content, such as an image, a video, an audio recording and/or the like. With the increased popularity of such behavior, the need for analysis of stored media content for purposes such as automatic organization of personal media collections, automatic summarization of individual videos, human-computer visual interaction, has also grown.
One of the first steps in analyzing videos consists of extracting features from the raw data. These features need to incorporate salient information about the video content in a compressed way. Many efforts have been made in designing features which extract the salient information from each frame of a video, such as color features, texture features, local interest points, etc. Moreover, features which specifically target videos (instead of simple images), i.e. incorporating the motion information, have also been developed, such as spatio-temporal interest points. Such temporal features have been shown to perform relatively well on standard datasets of videos captured by professional content producers (i.e. TV producers), in which camera motion is either rare/absent or well controlled by the cameramen. However, user generated videos (such as those recorded by common people using their camera-enabled mobile phones) are characterized by a lot of both intentional and unintentional camera motion, due to the uncontrolled settings and context in which the video recording happens. Temporal features extracted from such user generated videos are likely to perform very poorly because the motion of the content is confused with the motion of the camera. Methods which are able to cope with these problems are thus essential for the success of any analysis of motion in the content of mobile phone videos.
The present invention provides a method for overcoming the limitations of video temporal features which are corrupted by camera motion or by any other aspect which has the effect of changing the motion information of the real recorded objects or scene, such as zooming operations.