Extracting key frames (KF) from video is of great interest in many application areas. Main usage scenarios include printing from video (select or suggest the best frames to be printed), video summary (e.g. watch a wedding movie in seconds), video compression (optimize key frames quality when encoding), video indexing, video retrieval, and video organization. In general, key frames should show good quality and high semantic interest. However, what exactly is a key frame sometimes depends on the application. The level of requirement can also be different. For printing still pictures from video, one needs to put a strong focus on image quality. For rapid browsing one will need to increase the representativeness in semantics. Key frame extraction can be a feature offered in a camera (including a digital camera, camcorder, and camera phone), in desktop image/video editing/management software, and with online image/video service.
Key frame extraction is not a new problem. However, prior art has been focused on sports or news video with constrained structures. Such video conforms to well-defined common structures and characteristics. For instance, in field sports (including soccer, football, baseball, rugby, and cricket), there are two opposing teams and referees in distinct colorful uniforms, an enclosed playing area on grass or artificial turf, field lines and goals, commentator voice and spectator cheering, and finally, on-screen graphics (scoreboard). There are often a small number of canonic “views”: field view, zoom-in, and close-up. Other types of sports, such as racquet sports, basketball, as well as news videos, share a different set of structured characteristics. More importantly, there is unambiguous ground truth as to which are the key frames within the given context. In contrast, even the themed consumer videos (e.g., wedding, birthday party) do not have the same level of common structures and characteristics, and the key frame selection is open to a high level of subjectivity because of observer association, sentimental values, and other factors.
In addition, image quality (contrast, exposure, camera shake) is rarely a concern for sports and news video because of superior imaging equipment and well-controlled imaging conditions. Example systems for extracting key frames from sports and news videos include Avrithis, Y. S., Doulamis, A. D., Doulamis, N. D., and Kollias, S. D., “A Stochastic Framework for Optimal Key Frame Extraction from MPEG Video Databases,” Computer Vision and Image Understanding, 75(1/2), 1999, pp. 3-24; Liu, T., Zhang, H. J., and Qi, F., “A novel video key-frame-extraction algorithm based on perceived motion energy model,” IEEE Trans. Cir. Sys. Video Techno, 13(10), 2003, pp. 1006-1013; Y. Rui, A. Gupta, and A. Acero, “Automatically extracting highlights for TV Baseball programs,” ACM Multimedia 2000, pp 105-115; B. Li and M. I. Sezan, “Event Detection and Summarization in Sports Video,” IEEE Workshop on Content-based Access of Image and Video Libraries (CBAIVL), 2001, pp. 132-140.
Short movie clips captured by a digital camera with video capabilities (a recent product feature) are different. The variety in occasions and situations for consumer videos is unconstrained. Contrary to professional videos, there are no special effects, no tightly pre-defined structure, no professional editing, and a video clip represents only one shot. In that sense, video summary from a short clip is potentially easier than for those recorded by a camcorder because one does not need to perform video shot segmentation. Camera shake is often present and exposure is often problematic compared to professional videos. Above all, the biggest challenge with consumer video is its unconstrained content and lack of structure. Tong Zhang, in US patent application publication US 2005/0228849, “intelligent key-frame exaction from a video”, described a method for intelligent key frame extraction for consumer video printing based on a collage of features including accumulative color histogram, color layout differences, camera motion estimation, moving object tracking, face detection and audio event detection. Specifically, Zhang disclosed a method for extracting a set of key-frames from a video, comprising the steps of: selecting a set of candidate key-frames from among a series of video frames in the video by performing a set of analyses on each video frame, each analysis selected to detect a meaningful content in the video; arranging the candidate key-frames into a set of clusters; selecting one of the candidate key-frames from each cluster in response to a relative importance of each candidate key-frame.
Because the application of key frame extraction can vary significantly, for example, in terms of the desired number of key frames, it is often desirable to implement a flexible framework capable of producing a scalable video representation. The optimal number of relevant key frames is highly dependent on the video complexity. Complexity is a function of many features: camera motion, scene content, action and interaction between moving objects, image quality (IQ) due to lightning and camera setting, and so on. The video duration is also a parameter that could drive the video complexity: a longer movie clip is likely to contain more events and therefore demands more key frames.
One also need to define the best criteria of representativeness, and then determine what features can be used to obtain the ‘best’ key frames given the input data. Different features, such as those used in US 2005/0228849, vary significantly in terms of their effectiveness and computational cost. It is desirable to use as fewer features as possible to achieve a reasonable performance with reasonable speed.
Furthermore, because video clips taken by consumers are unstructured, one should rely only on cues related to the cameraman's general intents, i.e., camera and object motion descriptors. Rules applicable only to specific content only have limited use and need advance information about the video content.
Consequently, it would be desirable to design a system that is reliable, efficient, regardless of the image content.