1. Field of Art
The present invention generally relates to the field of digital video, and more specifically, to methods of accurately determining whether a video represents a particular concept.
2. Background of the Invention
Concept labeling of a digital video associates the video with an indicator of a concept that the video or a portion thereof represents, such as “cartoon” or “nature.” Automatic concept labeling of videos stored by video hosting services such as YOUTUBE is of great benefit to users of such systems, permitting the users to more effectively determine whether the video would be of interest to them by examining or otherwise using the video's concept labels.
Generally, existing automatic concept labeling approaches require a predefined set of concepts specified by a human expert such as a hierarchical taxonomy of predefined concepts. Using a supervised learning model, the human expert labels selected videos with the concepts, and provides those labeled videos to the system, which then learns the relationships between videos (e.g., video content or metadata) and the concepts. In large corpuses (e.g., tens of millions of videos), such a technique will likely not capture the full richness of the concepts illustrated by the videos. For example, a substantial corpus of user-contributed videos can represent a very large and diverse number of distinct concepts, which continues to change as new videos, reflective of new events in the real world, are introduced. Further, given the diversity of concepts in a large corpus, it is more likely that there will be videos that represent concepts that simply would not appear in a manually-specified taxonomy of concepts.
Some conventional techniques for automatic labeling analyze the user-supplied metadata associated with the videos to perform the concept labeling and depend heavily on the accuracy of the metadata to properly label the videos. Unfortunately, the user-supplied metadata is in many cases incomplete or inaccurate. For example, a user submitting a video might make unintentional errors such as misspellings, or might fail to make the effort to provide much or any descriptive textual metadata. A user submitting a video might also intentionally provide false metadata, e.g. as “spam” to induce other users to view the video. Thus, labeling techniques that uncritically accept the user-provided metadata, without employing measures that take the potential inaccuracy into account, frequently produce poor-quality results.
Further, certain types of concepts tend to be more readily recognized through analysis of a particular type of video feature. For example, whether or not a given video is representative of a “wide screen” concept is more readily recognized by analyzing visual content features than by analyzing textual features, since whether a video is formatted for widescreen display is inherent in the visual content of the video itself, whereas it is unlikely to be specifically mentioned in the textual metadata. Thus, an analysis technique based on one particular type of feature may work well for recognition of many concepts but will likely fail to accurately recognize other types of concepts.