The growth of online video is truly remarkable. Comscore estimates that over 75% of US internet users view online video. They spend an average of 235 minutes per month accounting for a total of 5 billion videos watched.
The content type typically determines the viewing experience. For example, premium content offers a rich and interactive viewing experience to the user. Metadata that accompanies the content such as story summary, cast and director profiles, ratings, user comments and chaptering contribute to the overall experience. Premium content available on the web is usually purchased, and is typically 30 minutes or longer in duration.
In contrast, free content is mostly user generated and offers a “no frills” viewing experience. Text, occasional thumbnails, user ratings and links are part of this viewing experience. Viewing is typically restricted to “start-to-end” playback with “blind” seeking (no visual guidance to content). The average length of a free content stream is 1-5 minutes, with Comscore estimating the average duration of an online video to be 2.9 minutes.
Given that the vast majority of online content is free (and user generated), there is a growing need to improve the current “no frills” viewing experience of free content.
The enhancement of the online video experience is a goal shared by many. As a result, many solutions have been developed. The solution of choice for premium content is metadata. Metadata is information related to content that can appear as text, images, video or audio to provide story summary, actor and director profiles, deleted scenes and chaptering that allows customized playback. Additionally, metadata is complemented by related links, user comments and ratings. Metadata adds a descriptive and interactive layer to content playback. Content creators, distributors, and companies in between have recognized its value, and have made metadata an integral part of the premium content offering.
Unfortunately, the metadata creation process for premium content does not scale for free content, due to its dependency on manual creation. Manual processing of user generated free content is an economically unrealistic proposition, so automated methods are needed. These methods may act on audio and video aspects of the content to extract meaningful information. They can be thought of as producing machine-generated metadata.
The automated methods fall into one of three categories—audio/video analysis, codec technology and industry standards.
The category that has received the most attention from academia is audio/video analysis. These methods analyze the audio and video data of the content and attempt to extract key information that is meaningful to the user. Compressed domain video analysis, motion analysis, object segmentation, text detection, spectrum analysis and speech to text conversions are some techniques used to extract key information. Most methods provide good accuracy, but their complexity limits use in real-time applications and on resource-constrained consumer devices. Therefore, most audio/video analysis is performed offline.
Codec technology offers an alternative automated metadata generation process for free content. In this case, key information regarding the content is encapsulated within the compressed stream during the encoding process. The playback process extracts and presents them alongside the content. Codec standards such MPEG2, MPEG4-Part 2, AVC(H.264), VCI and other advanced codecs define special profiles to support this capability. Unfortunately, this method adds a high degree of complexity to the encoding and decoding process, which has restricted its wide-scale use.
The third method is the use of industry standards. Standards such as MPEG-7, MPEG-21 and HTML-5 attempt to enrich the online video experience by enabling search, sharing and enhanced display of key information in content. The popularity of MPEG-7 and MPEG-21 has been limited as they do not address the fundamental issue of key information extraction from content. Instead, these standards provide a mechanism to query and share information between devices. HTML-5 has gained noticeable attention in the press recently. It proposes a major revision to the video tag that enables dynamic and interactive access to playback content shown on a browser. Video window orientation, coloring, edge effects, and trick-mode controls are some of the effects proposed by the standard. HTML-5 may be exceptional as it holds promise for enhancing the online video experience through its rich graphics operations and audio/video effects.
To recap, free content requires the addition of metadata such as key frames, scene classification, and summarization etc. to mirror the rich video experience offered by premium content. However, unlike premium content, it is unrealistic to expect this data to be generated by the user and tagged onto the stream. In order to be a viable option, the data needs to be generated in real-time, requiring only modest computing resources. Current approaches discussed above fail to meet this requirement due to various factors. Audio and video analysis techniques may have the power to create the metadata, but due to its complexity, computing resources far exceeding capabilities on consumer media devices are required. Additionally, these techniques are not real-time, thus being best suited for offline creation. Codec technologies have demonstrated their ability to embed key metadata into the stream during the encoding process. However, encoding complexity and the lack of supporting decoders/players limit their use. Industry standards including HTML-5 do not provide a comprehensive solution either, as they fail to address the core issue of meta data creation.
This patent application describes a solution to these challenges.