Technical Field
The present disclosure relates to the manipulation of electronic media content, including electronic media content available over the Internet. More particularly and without limitation, the present disclosure relates to systems and methods for the identification, ranking, and display of available or recommended electronic media content on the Internet, based on speech recognition.
Background Information
On the Internet, people usually discover and view multimedia and other electronic media content in one or more fundamentally different ways: keyword searching, browsing collections, selecting related content, and/or link sharing. One common way to browse a video collection is to display a list of images that the user can browse and click to watch the videos. A user interface may be provided to allow the user to narrow the displayed list by one or more criteria, such as by category, television show, tag, date produced, source, or popularity. User interfaces may also provide the ability for users to search for videos, or other electronic media.
The performance of video search engines can be evaluated by examining the fraction of videos retrieved that are relevant to a user query and the fraction of retrieved videos that are relevant to the user's need. The traditional way for enabling searching for video content is based on metadata for a video, such as title, description, tags, etc. There are two drawbacks with this approach. First, the metadata is usually quite limited and it only provides a very brief summary of a video. In addition, the metadata of a video may not be reliable or complete, especially for those videos from a user-generated video site, such as YouTube. For example, many videos from YouTube are in fact spam videos having metadata that has nothing to do with the content of the video.
Speech-to-text techniques may be used to augment the metadata of a video and to improve recall from a collection of videos. Also, a popularity and/or collaborative filter may be used to improve precision. In addition, visual analysis to identify people or objects contained within a video can be used in some cases for both improved recall and precision. However, these techniques also have drawbacks. For example, analyzing the visual content of a video to identify people and objects is computationally resource intensive and often inaccurate. Also, using only visual analysis to identify people in a video can lead to unreliable or incomplete results because the video may contain still or moving images of a person with a voice over by a narrator.
As a result, users of the Internet are often unable to find desired media content, and they often view content that they do not necessarily appreciate. Undesirable content can lead to users traveling away from the content sites, which may result in an attendant decrease in advertising revenue. As a corollary, the successful display and recommendation of electronic media content can be useful in attracting and retaining Internet users, thereby increasing online advertising revenue.
As a result, there is a need for improved systems and methods for manipulating electronic media content, including available or recommended electronic media content on the Internet. Moreover there is a need for improved systems and methods for the identification, ranking, and/or manipulating of available or recommended electronic media content on the Internet, based on speaker recognition.