In a very short time period, YouTube has become one of the biggest video databases in the world. Featuring millions of videos, each one about 9 Mbytes big and several minutes long, thousands of new videos are uploaded each day. While YouTube user-generated videos are often short—minutes, not hours—iTunes, MSN, and Google Video offer short, episodic, and full length content. Other types of media with temporal dimension are also prevalent: for example, slide-shows, music, annotated music, sequenced images, and so on. All these media are more and more accessed via the mobile Web browser or via mobile applications installed on the mobile device. Most mobile Web sites and applications, however, offer very poor and limited tools for content-understanding, that is, tools to help customers quickly understand the gist or substance of the content, especially video, they are interested in. At the same time, the mobile device market is thriving with scores of manufacturers making high-function mobile devices such as media players, cellular phones, PDA's, and so on. Many of these devices now employ touch or multi-touch screen technology.
Media summarization is the process of compacting, laying out, or otherwise making more accessible the complex contents of media to enable media content understanding. Gaining content understanding, then, is the act of browsing through content in order to create a mental model of it to some sufficient degree. The user's sufficiency requirements may hinge on their ability to determine specific media content such as: “Is a goal scored in the first 10 minutes of this football video?”, “Does the video have a scene in which two men fight onboard a helicopter?”, “Does the video have a scene in which a cat falls off a ledge after a baby scares it?”, “Does the music score have a large crescendo around the midpoint of the score?”. The above types of questions are almost impossible to be resolved on today's Web-centric media sharing sites such as Yahoo!®, Google™, and YouTube. In addition, to support these determinations, users require visual tools that support multiple intuitive ways to change their “field of view” into the media. Thus the benefits of content-based browsing—especially with respect to video—are clear.
There are few effective tools for video content non-linear browsing and understanding on high-functionality mobile devices and even fewer that exploit multi-touch technology. For example, FIG. 1 depicts YouTube on a mobile device. YouTube.com does not provide informative “preview” information for videos apart from a few video keyframes. Content-understanding comes only from the keyframe, the video duration (e.g. 03:52 min), and the aggregated “user tags” created by the community. Complex content cannot be easily inferred.
YouTube's Warp tool shows the relationships between videos in a graphical way, but not fine-grain details of the content within a given video. YouTube's Java application for smartphones only previews content from a single keyframe. MotionBox.com and other similar sites use the prevalent technique of showing a static keyframe strip below the movie. Guba.com employs a 4×4 matrix of keyframes for any given video, but the representation is non-navigable. Internet Archive Website lays out one keyframe for each minute of a video in question, to allow a somewhat weak or diluted view of the video content. Finally, note that the current art also enables a partial and limited video understanding through the use of textual “tags” but that the tag paradigm has several drawbacks that make it unsuitable as a generic media indexing paradigm, including: its weak semantics, low scalability, lack of hierarchy. These drawbacks make that paradigm particularly unsuitable for video content understanding, at least as the sole method of indexing.
Multi-touch is a human-computer interaction technique and the hardware devices that implement it, which allow users to compute without conventional input devices (e.g., mouse, keyboard). Multi-touch or a “multi-touch screen”, consists of a touch screen (screen, table, wall, etc.) or touchpad, as well as software that recognizes multiple simultaneous touch points, as opposed to the standard (single) touchscreen (i.e. computer touchpad, ATM), which recognizes only one touch point. This effect is achieved through a variety of means, including but not limited to: heat, finger pressure, high capture rate cameras, infrared light, optic capture, and shadow capture. This definition of the term multi-touch applies throughout the present application.