High-quality digital camcorders and cameras are widely available that are capable of capturing video footage and storing the footage directly to a mass-storage device. The storage usually involves encoding of the raw video into a compressed format suitable for efficient storage, and also communication, for example, via streaming. A number of formats exist, including those in the MPEG family, such as MPEG-2. Users, and particularly unskilled or “domestic” users tend to shoot a large amount of video content and then dump the content to a hard disk, or other storage device such as an optical CD, with the expectation that the user would be able to subsequently locate desired or interesting video segments, given the random-access nature of the storage medium and the video formats. Nevertheless, the reality is that to locate a video segment of interest, the user has to manually browse through the content in a more or less linear fashion. This is particularly the case in respect of a single clip of video capture where metadata (other than the format, file size or actual time of capture) is not generally available to permit the user to distinguish between portions of the clip. A clip is a contiguous sequence of image frames starting with an “in” frame and ending with an “out” frame. The “in” and “out” frames are usually indicative of the commencement and cessation of recording in raw video, or corresponding edit mints in edited video.
To reduce the amount of effort for locating desired or interesting video segments, some systems perform automatic segmentation of the video content. Once a video is segmented, a user can search some of the segments instead of the whole video for the interesting passages, or simply use the segments automatically created for them. Unfortunately, existing systems are far from perfect. Most perform pixel-based segmentation that detect segment boundaries using a sum of pixel differences, a sum of the number of changed pixels, motion compensated pixels differences, global/local color histograms, edge pixels, and other approaches. These methods require the video to be fully decompressed and are only possible on a more capable device such as a personal computer, as compared to a less capable device, such as the video camera by which the video was captured, or a simpler reproduction device such as a set top box (STB). Even with a personal computer, the known automatic segmentation methods demand considerable processing time. Other systems perform video segmentation in the compressed domain using features such as DCT (Discrete Cosine Transform) coefficients especially the DC component of an encoded frame, macroblock coding models, motion vector estimates, bitrate assessment, amongst other approaches. These methods are faster, but still require the video to be partially decompressed to extract the required features.
There is one particular method that is capable of using only encoding cost (that is, the number of bits used for encoding a video frame) for video segmentation and, hence, requires no decompression of the video. A cost function based on the encoding cost has to be defined for each type of image frame, such as P-type frames and I-type frames in the MPEG-2 standard. The cost for a sequence of video frames is computed using the cost function. Local maxima of the cost sequences are then compared to a pre-determined threshold to determine the existence of a segment boundary. However, when applied to so-called “home” videos using the recommended threshold value, the method often resulted in a large number of segments with very short durations.
Indeed, existing methods do not perform well on “home” videos. This is because the existing methods were mostly designed for segmenting edited video to recover the multiple shots that were originally captured and from which the edited video was formed. The existing methods look for sharp cuts and transitional effects arising from the editing process and which are not present in a typical unedited (raw) (home) video. In addition, existing methods segment a video into a single level of non-overlapping segments and do not allow a user to explore, annotate and utilise the video at different levels of granularity.