1. Technical Field
The invention is related to interactive browsing of video recordings, and in particular, to a system and method for providing interactive browsing of a video recording via user selection of points within panoramic images that are automatically generated from clusters of one or more frames of the video recording.
2. Related Art
Many video recordings typically record a variety of “scenes” throughout the duration of the video recording. Further, with many such recordings the camera often moves about or to point to different areas of particular scenes while the video is being recorded, thereby capturing an even larger number of scenes. Therefore, the amount of video data available to an average consumer can quickly become overwhelming as the number and length of those video recordings increase. Once the video has been recorded, a manual review of the video to determine whether anything of interest has occurred during the period of the recording is typically a time consuming process, especially for long video recordings. Similarly, where the user is looking for a particular scene, object, or event within the video, it can also be a time consuming process to locate that information within the video recording. Consequently, there is a need for efficient general-purpose tools to provide the user with a capability to quickly browse and navigate such video information, without requiring the user to actually watch the entire length of the video recording in reviewing the content of that video recording.
A number of conventional schemes have been implemented in an attempt to address this problem. For example, a number of partially or fully automated schemes based on the concept of video “key frames” have been advanced for retrieving and browsing video. In general, such schemes operate by identifying and extracting particular representative frames (i.e., key frames) in a video sequence which meet some predefined criteria, such as motion detection, target object detection, color detection, change detection, scene-cut or scene-change information, etc.
For example, conventional change detection methods, including for example, pixel-based and region-based methods, are a common way to detect events of interest in video recordings. Typically, such schemes use conventional frame-to-frame differencing methods in various algorithms for identifying the video key frames. In general, such change detection methods often use a threshold value to determine whether a region of an image has changed to a sufficient degree with respect to the background. Such change detection techniques have been further improved by applying “classical” morphological filters or statistically based morphological filters to “clean up” initial pixel level change detection, making detection thresholds more robust.
Regardless of what methods are used to identify the key frames, once they have been identified, there are number of schemes that have been adapted to organize the key frames into user selectable indexes back into the original video. For example, one conventional key frame based scheme organizes the key frames into interactive “comic books” comprised of a single frame-sized thumbnail for each scene. A similar scheme organizes the key frames into “video posters.” In general, both of these schemes use different key frame layout schemes to provide the user with a number of user-selectable key frames that are indexed to the original video. In other words, the extracted key frames are typically presented as a series of individual thumbnail images to the user. Related schemes have implemented the concept of hierarchical key-frame clustering to group or cluster particular key-frames based on some similarity metric, thereby creating a cluster or group of theoretically similar video sequences. In any case, whether using individual or clustered key-frames, the user will then select a particular key frame (or cluster) as an entry point into the video so as to play back a portion of the video beginning at or near the time index associated with the selected key frame.
Unfortunately, one problem with such schemes is that conventional key-frame schemes tend to be profuse in unimportant details and insufficient for overall video understanding (only a frame-sized thumbnail per shot). Consequently, long lists of shots or scenes result in an information flood rather than an abstraction of the video recording. In other words, as the length of the video increases, the number of key frames also typically increases. As a result, typical key frame indices can be difficult or time consuming for a user to quickly review. Further, individual key-frame thumbnails are not typically sufficient for a user to judge the relevance of the content represented by that key-frame. In addition, user selection of a particular key frame will play back the entire video sequence associated with that key-frame, even where the user may only be interested in a particular element within the key-frame or video sequence.
Another approach uses the concept of image mosaics to provide video browsing and navigation capabilities to provide similar functionality to video key-frame schemes discussed above. In general, these mosaicing schemes typically generate static mosaic images from particular “scenes” within an overall video recording. In other words, image mosaics are generated from the set of sequential images represented by each scene. That mosaic is the presented to the user in the same manner as the video key-frame schemes described above. In fact, such mosaicing schemes are often simply extensions of conventional key-frame schemes.
For example, as noted above, each key frame provides a representative image as an index to a set of sequential image frames. The mosaicing schemes typically use that same set of image frames is to create the mosaic image, which then provides a visual index back to the same set of sequential image frames. In other words, the primary difference is that the same set of image frames is represented either by a representative thumbnail image (key-frame), or by thumbnail image mosaic. Unfortunately, in addition to having problems similar to those of conventional key-frame schemes, these conventional mosaicing schemes have additional problems, including, for example, constraints on camera motions and requirements that target scenes should be approximately planar and should not contain moving objects.
Still other video indexing schemes have attempted to summarize longer videos by generating a shorter video that preserves the frame rate of key elements of certain portions of the original video, while greatly accelerating portions of the video in which nothing of interest is occurring. These schemes are sometimes referred to as “video skimming” techniques. Such schemes often focus on extracting the most “important” aspects of a video into summary clips that are then concatenated to form the video summary or “skim.” However, even such video skimming techniques can result in lengthy representations of an overall video recording, especially where the length of the video increases and the number of events of interest within the video increases.
Therefore, what is needed is a system and method which models the sources of information content in a video recording to provide the user with an overview or summary of the contents of the video. In addition, such a system and method should implement an intuitive video navigation interface that provides the user with a direct entry into particular points or segments within the overall video recording. These particular points or segments should correspond to particular elements or features of video scenes, for targeting playback of only those frames of interest, rather than merely providing a playback of an entire scene corresponding to a selected key-frame or mosaic as with conventional browsing and navigation schemes.