The World Wide Web (xe2x80x9cWWWxe2x80x9d) is comprised of millions of documents (web pages) formatted in Hypertext Markup Language (xe2x80x9cHTMLxe2x80x9d), which can be accessed from thousands of users through the Internet. To access a web page, its Uniform Resource Locator (xe2x80x9cURLxe2x80x9d) must be known. Search engines index web pages and make those URLs available to users of the WWW. To generate an index, a search engine, such as Compaq Computer Corporation""s ALTAVISTA search engine, may search the WWW for new web pages using a web crawler. The search engine selects relevant information from a web page after analyzing the content of the web page and saves the relevant information and the web page""s URL in the index.
Web pages also contain links to other documents on the WWW, for example, text documents and image files. By searching web pages for links to image files, a search engine connected to the WWW, such as Compaq Computer Corporation""s ALTAVISTA Photo Finder, provides an index of image files located on the WWW. The index contains a URL and a representative image from the image file.
Web pages also contain links to multimedia files, such as video and audio files. By searching web pages for links to multimedia files, a multimedia search engine connected to the WWW, such as Scour Inc.""s SCOUR.NET, provides an index of multimedia files located on the WWW. SCOUR.NET""s index for video files provides text describing the contents of the video file and the URL for the multimedia file. Another multimedia search engine, WebSEEK, summarizes a video file by generating a highly compressed version of the video file. The video file is summarized by selecting a series of frames from shots in the video file and repackaging the frames as an animated GIF file. WebSEEK also generates a color histogram from each shot in the video to automatically classify the video file and allow content-based visual queries. It is described in John R. Smith et al. xe2x80x9cAn Image and Video Search Engine for the World-Wide Webxe2x80x9d, Symposium on Electronic Imaging: Science and Technologyxe2x80x94Storage and Retrieval for Image and Video Databases V, San Jose, Calif., February 1997, ISandT/SPIE.
Analyzing the contents of digital video files linked to web pages is difficult because of the low quality and low resolution of the highly compressed digital video files.
The present invention provides a mechanism for efficiently indexing video files and has particular application to indexing video files located by a search engine web crawler. A key frame, one frame representative of a video file, is extracted from the sequence of frames. The sequence of frames may include multiple scenes or shots, for example, continuous motions relative to a camera separated by transitions, cuts, fades and dissolves. To extract a key frame, shot boundaries are detected in the sequence of frames, a key shot is selected from shots within the detected shot boundaries, and the key frame is determined in the selected key shot.
In preferred embodiments, the shot boundaries are detected based on the result of a first test or a second test which are dependent on backward and forward discontinuities of pixel-wise differences between successive frames and on a distribution of a luminance histogram for each frame. The first test determines if the greater of the forward discontinuity and the backward discontinuity of a frame, in relation to the luminance histogram of the frame, is greater than a first threshold value. The second test determines if the greater of the forward discontinuity and the backward discontinuity in pixel wise frame difference of a frame in relation to the luminance histogram is greater than a second threshold value or the result of a first test if the maximum discontinuity in pixel-wise frame difference of the frame in relation to the luminance histogram of the frame is greater than the second lesser threshold and the minimum discontinuity in pixel-wise frame difference in relation to the luminance histogram is less than a third threshold value less than the second threshold value.
The key shot is selected based on the level of skin color pixels, motion between frames, spatial activity between frames and shot length. The key frame is selected based on frame activity: spatial activity within a frame and motion between frames.