It has becoming increasingly important for users to able to find digital video items, such as video clips and movies, in which they are interested. However, due to the visual nature of digital video items, searching for digital video items of interest tends to be more difficult than searching for textual items of interest, such as documents.
Many popular search video sites on the Internet present search interfaces that allow a user to obtain a set of potentially-interesting video items based on a keyword search against metadata, such as video titles. Once the set of potentially-interesting video items has been obtained, the user is presented with an interface designed to help the user decide which of the potentially-interesting videos will actually be most interesting to the user. An example of such an interface is depicted in FIG. 1, where for every single video search result, part of its metadata (title, description, duration) and a representative image are shown.
Referring to FIG. 1, it depicts three images that correspond to the top three videos for the query “new york”. Adjacent to each image is metadata relating to the corresponding video, including the title of the video. Below the listing of the top three search results are two additional images. These two additional images are hand-picked frames from the video listed in the top ranked search result.
While the metadata information about the videos may help the user to select the result in which the user is interested, research on saliency has shown that images are important eye-catchers. The image by which a video is represented within the search results is intended to give a direct preview of what the video is about. Consequently, selecting the most representative and engaging image is essential to drive the user video consumption. The images depicted in FIG. 1 are examples of how badly-selected images may reduce video consumption. Specifically, the images shown for the top three search results are identical, and convey nothing about the specific content of the video.
Typical videos are encoded with frame rates equal to or higher than 15 frames per second, which means that for a relatively short video of 30 seconds, 450 images are candidates to represent it. Publishers currently use a variety of techniques to select these images. For example, in some cases, the frame that is used as an image to represent the video (the “representative image”) is manually selected by an editor. Unfortunately, while editors may be able to select the frames that would appeal to the greatest percentage of users, editor-executed selection is time-consuming and cannot easily be scaled to large volume situations.
To avoid the costs associated with manual selection, some publishers choose one frame at random from the whole duration of the video. However, a randomly-selected frame has a relatively low likelihood of being interesting to users.
Rather than select a frame randomly, the frame at a predetermined offset may be selected (e.g. the frame that is 15% into the video). To select a frame at a particular offset, the frame extraction process may be embedded in the video encoding process. Specifically, while encoding a video, the encoder determines the length of the video, and the image at the determined video offset is extracted and included as part of the asset meta-data. Selecting frames at pre-defined offsets may yield better results on average than a totally random selection, but may still result in the selection of non-representative, uninformative and/or uninteresting frames.
Another technique is to use, as the representative image, an external image that identifies a generic concept related to the video (such as the provider logo). Such a technique may further other goals, such as increasing the market's awareness of a publisher, but is not likely to entice a user to watch the particular video with which the external image is used.
To avoid the cost of manual selection, and yet achieve better results than random selection or external images, some techniques have been developed for selecting a frame based on some characteristic of the image in the frame. For example, U.S. Pat. No. 7,826,657 describes a technique for selecting a representative image based on color characteristics of the images. Techniques that use a specific characteristic as the basis for selecting a representative frame for a video yield good results when the characteristic proves to be a good indicator of “quality” or “representativeness”. However, it is unlikely that any characteristic will prove to be a good indicator of which frames are best to represent all videos, to all users, in all situations.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.