Video key-frame extraction algorithms select a subset of the most representative frames from an original video. Key-frame extraction finds applications in several broad areas of video processing research such as video summarization, creating “chapter titles” in DVDs, and producing “video action prints.”
Video key-frame extraction is an active research area, and many approaches for extracting key frames from the original video have been proposed. Conventional key-frame extraction approaches can be loosely divided into two groups: (i) shot-based, and (ii) segment-based. In shot-based video key-frame extraction, the shots of the original video are first detected, and then one or more key frames are extracted for each shot. For example, Uchihashi et al., in the article “Summarizing video using a shot importance measure and a frame-packing algorithm” (IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3041-3044, 1999) teach segmenting a video into its component shots. Unimportant shots are then discarded using a measure of shot importance. The key-frames are generated for each of the remaining important shots.
Another method taught by Zhang et al. in the article “An integrated system for content-based video retrieval and browsing” (Pattern Recognition, pp. 643-658, 1997) segments a video into shots and determines key frames for each shot based on feature and content information.
Arman et al., in the article “Content-based browsing of video sequences” (Proc. 2nd ACM International Conference on Multimedia, pp. 97-103, 1994) teach using video shots as the basic building blocks. After shot detection, the tenth frame of each shot is selected as the key frame.
Another method taught by Wang et al., in the article “Video summarization by redundancy removing and content ranking” (Proc. 15th International Conference on Multimedia, pp. 577-580, 2007), detects shot boundaries by color histogram and optical-flow motion features, and extracts key frames in each shot by a leader-follower clustering algorithm. A video summary is then generated by key frame clustering and repetitive segment detection.
In segment-based video key-frame extraction approaches, a video is segmented into higher-level video components, where each segment or component could be a scene, an event, a set of one or more shots, or even the entire video sequence. Representative frame(s) from each segment are then selected as the key frames.
In U.S. Pat. No. 7,110,458, entitled “Method for summarizing a video using motion descriptors”, Divakaran et al. teach a method for forming a video summary that measures an intensity of motion activity in a compressed video and uses the intensity information to partition the video into segments. Key frames are then selected from each segment. The selected key frames are concatenated in temporal order to form a summary of the video.
Uchihashi et al., in the article “Video manga: generating semantically meaningful video summaries” (Proc. 7th ACM International Conference on Multimedia, pp. 383-392, 1999) use a tree-structured representation to cluster all the frames of the video into a predefined number of clusters. This information is then exploited to segment the video. The relevant key frames for each segment are selected based on the relative importance of video segments.
Rasheed et al., in the article “Detection and representation of scenes in videos” (IEEE Multimedia, pp. 1097-1105, 2005) construct a weighted undirected graph called a “shot similarity graph” (SSG) for clustering shots into scenes. The content of each scene is described by selecting one representative frame from the corresponding scene as a scene key-frame.
Girgensohn et al., in the article “Time-constrained keyframe selection technique” (IEEE International Conference on Multimedia Computing Systems, pp. 756-761, 1999) use a hierarchical clustering algorithm to cluster similar frames. Key frames are extracted by selecting one frame from each cluster.
Another method taught by Doulamis et al., in the article “A fuzzy video content representation for video summarization and content-based retrieval” (Signal Processing, pp. 1049-1067, 2000) extracts key frames by minimizing a cross correlation criterion among the video frames by means of a genetic algorithm. The correlation is computed using several features extracted using color/motion segmentation on a fuzzy feature vector formulation basis.
All of the above methods rely on the accuracies of the feature selection and clustering algorithms used for shot detection and video segmentation. Furthermore, these approaches are vulnerable to noise, and are not very data adaptive. Thus, there exists a need for video key-frame extraction framework that is data adaptive, robust to noise, and less sensitive to feature selection.