In a very short time period, YouTube has become one of the biggest video databases in the world. Featuring millions of videos, each one about 9 Mbytes big and several minutes long, thousands of new videos are uploaded each day. While YouTube user-generated videos are often short—minutes, not hours—iTunes, MSN, and Google Video offer short, episodic, and full length content. Other types of media with temporal dimension are also prevalent: for example, slide-shows, music, annotated music, sequenced images, and so on. All these media are more and more accessed via the mobile Web browser or via mobile applications installed on the mobile device. Most mobile Web sites and applications, however, offer very poor and limited tools for content-understanding, that is, tools to help customers quickly understand the gist or substance of the content, especially video, they are interested in.
“Content understanding” means the act of browsing through content in order to create a mental model of it to some sufficient degree. The user's sufficiency requirements may hinge on their ability to determine specific details such as: “Is a goal scored in the first 10 minutes of this football video?”, “Does the video have a scene in which two men fight onboard a helicopter?”, “Does the video have a scene in which a cat falls off a ledge after a baby scares it?”. The above types of questions are almost impossible to be resolved on today's Web-centric media sharing sites such as Yahoo!®, Google™ and YouTube. Thus the benefits of content-based browsing—especially with respect to video—are clear in cases where media content complexity is anything more complicated than “trivial”.
There are few effective tools for video content non-linear browsing and understanding on mobile devices. For example, FIG. 1 depicts YouTube on a mobile handset. YouTube.com does not provide informative “preview” information for videos apart from a few video keyframes. Content-understanding comes only from the keyframe, the video duration (e.g. 03:52 min), and the aggregated “user tags” created by the community. Complex content cannot be inferred (e.g., “is this the one where the guy on the sled hits the other guy after going over a ramp?”).
YouTube's Warp tool shows the relationships between videos in a graphical way, but not fine-grain details of the content within a given video. YouTube's Java application for smartphones only previews content from a single keyframe. MotionBox.com and other similar sites use the prevalent technique of showing a static keyframe strip below the movie. Guba.com employs a 4×4 matrix of keyframes for any given video, but the representation is non-navigable. Internet Archive Website lays out one keyframe for each minute of a video in question, to allow a somewhat weak or diluted view of the video content. Finally, note that the current art also enables a limited video understanding through “tags” but that the tag paradigm (also known as “folksonomy”) has several drawbacks including: weak semantics, low scalability, lack of hierarchy. These drawbacks make it unsuitable for deep video content understanding.