The present invention relates to a method and a system for video summarization, and in particular to a method and a system for key frame extraction and shot boundary detection.
Recent developments in personal computing and communications have created new classes of devices such as hand-held computers, personal digital assistants (PDAs), smart phones, automotive computing devices, and computers that allow users more access to information.
Many of the device manufacturers, including cell phone, PDA, and hand-held computer manufacturers, are working to grow the functionalities of their devices. The devices are being given capabilities of serving as calendar tools, address books, paging devices, global positioning devices, travel and mapping tools, email clients, and Web browsers. As a result, many new businesses are forming around applications related to bringing all kinds of information to these devices. However, due to the limited capabilities of many of these devices, in terms of the display size, storage, processing power, and network access, there are new challenges for designing the applications that allow these devices to access, store and process information.
Concurrent with these developments, recent advances in storage, acquisition, and networking technologies has resulted in large amounts of rich multimedia content. As a result, there is a growing mismatch between the rich content that is available and the capabilities of the client devices to access and process it.
In this respect so called key-frame based video summarization is an efficient way to manage and transmit video information. This representation can be used within the MPEG-7 application Universal Multimedia Access as described in C. Christopoulos et al., xe2x80x9cMPEG-7 application: Universal access through content repurporsing and media conversionxe2x80x9d, Seoul, Korea, March 1999, ISO/IEC/JTC1/SC29/WG11 M4433, in order to adapt video data to the client devices.
For Audio-Visual material, the key frame extraction could be used in order to adapt to bandwidth and computational capabilities of the clients. For example, low bandwidth or low capability clients might request only the audio information to be delivered, or only he audio combined with some key frames. High bandwidth and computational efficient clients can request the whole AV material. Another application is fast browsing to digital video. Skipping video frames at fixed intervals reduce the video viewing time. However this merely gives a random sample of the overall video.
Below the following definitions will be used:
Shot
A shot is defined as a sequence of frames captured by one camera in a single continuous action in time and space, see also J. Monaco, xe2x80x9cHow to read a film,xe2x80x9d Oxford press, 1981.
Shot Boundary
There are a number of different types of boundaries between shots. A cut is an abrupt shot change that occurs in a single frame. A fade is a gradual change in brightness resulting in (fade-out) or starting with a black frame (fade-in). A dissolve occurs when the images of the first shot become dimmer and the images of the second shot become brighter, with frames within the transition showing one image superimposed on the other one. A wipe occurs when pixels from the second shot replace those of the first shot in a regular pattern such as a line from the left edge of the frames.
Key Frame
Key frames are defined inside every shot. They represent with a small number of frames, the most relevant information content of the shot according to some subjective or objective measurement.
Conventional video summarization consists of two steps:
1. Shot boundary detection.
2. Key-frame extraction.
Many attributes of the frames such as colour, motion and shape have been used for video summarization. The standard algorithm for shot boundary detection in video summarization is based on histograms. Histogram-based techniques are shown to be robust and effective as described in A. Smeulders and R. Jain, xe2x80x9cImage databases and Multi-Media searchxe2x80x9d, Singapore, 1988, and in J. S. Boreczky, and L. A. Rowe, xe2x80x9cComparison of Video Shot Boundary Detection Techniquesxe2x80x9d,Storage and Retrieval for Image and Video Databases IV, Proc. of ISandT/SPIE 1996 Int""l Symp. on Elec. Imaging: Science and Technology, San Jose, Calif., February 1996.
Thus, the colour histograms of two images are computed. If the Euclidean distance between the two histograms is above a certain threshold, a shot boundary is assumed. However, no information about motion is used. Therefore, this technique has drawbacks in scenes with camera and object motion.
Furthermore, key frames must be extracted from the different shots in order to provide a video summary. Conventional key frame extraction algorithms are for example described in: Wayne Wolf, xe2x80x9cKey frame selection by motion analysisxe2x80x9d, in Proceedings, ICASSP 96, wherein the optical flow is used in order to identify local minima of motion in a shot. These local minima of motion are then determined to correspond to key frames. In W. Xiong, and J. C. M. Lee, and R. H. Ma, xe2x80x9cAutomatic video data structuring through shot partitioning and key-frame selectionxe2x80x9d, Machine vision and Applications, vol.10, no. 2, pp. 51-65, 1997, a seek-and-spread algorithm is used where the previous key-frame as a reference for the extraction of the next key-frame. Also, in R. L. Lagendijk, and A. Hanjalic, and M. Ceccarelli, and M. Soletic, and E. Persoon, xe2x80x9cVisual search in a SMASH systemxe2x80x9d, Proceedings of IEEE ICIP 97, pp. 671-674, 1997, a cumulative action measure of shots in order to compute the number and the position of key-frames allocated to each shot is used. The action between two frames is computed via a histogram-difference. One advantage of this method is that the number of key-frames can be pre-specified.
It is an object of the present invention to provide a method and a system for shot boundary detection and key frame extraction, which can be used for video summarization and which is robust against camera and object motion.
This object and others are obtained by a method and a system for key frame extraction, where a list of feature points is created. The list keeps track of individual feature points between consecutive frames of a video sequence.
In the case when many new feature points are entered on the list or when many feature points are removed from the list between two consecutive frames a shot boundary is determined to have occurred. A key frame is then selected between two boundary shots as a frame in the list of feature points where no or few feature points are entered or lost in the list.
By using such a method for extracting key frames from a video sequence motion in the picture and/or camera motion can be taken into account. The key frame extraction algorithm will therefore be more robust against camera motion.