As computer and video technology has advanced, there has been an increasing interest in bringing the convenience of hypertext-like linking and annotation to video and multimedia applications. Users, now accustomed to the ability of instantly clicking on links in internet web pages, now increasingly want to have a similar ability of instantly finding out more about the objects that are displayed in video content. Such a type of rich video content where objects serve as anchors associated with some actions is often referred to as “hypervideo”, by analogy to hypertext.
As a result, a number of multimedia methods have developed that attempt to link hypertext and other information (metadata) to video images. Metadata is a generic term referring here to semantic information describing the video content. Such information can be description of the entire video sequence, temporal portions of it or spatio-temporal objects.
The creation of metadata associated with video objects is often referred to as “annotation”. This can be an automatic, semi-automatic or manual process. Typically, the user indicates an object (often an image of a real-world object such as a person, place, or thing). This can be done, for example, by defining the object's boundaries by a manual process, or by pointing to a pre-segmented object in a semi-automatic process. The user then creates a metadata entry associated with this indicated or pointed to object.
Typically, the metadata includes at least two data structures, either linked or separate from each other. The first structure is the representation of video objects, and the second structure is their annotation. Representation of video objects can be done in different manners, including, for example, bounding boxes, collections of primitive geometric figures (constructive geometry), octrees and binary space partition trees, volume maps, etc. Such a representation allows the user to determine a spatio-temporal coordinate in the video (e.g., the spatial and temporal “location” in the video where another user might later use to select the video object by clicking a mouse or other pointing device), and select a video object located at this location. Having identified the video object, its corresponding annotation can then be retrieved from an annotation database.
Annotation, in turn, is a wide term referring to some description of video objects. Such a description may contain textual information, multimedia description, or description of actions associated with an object and performed when this object is clicked.
As an example of an industrial standard for metadata description, the MPEG-7 multimedia content description interface standard provides for an extensible markup language (XML) that for the representation of objects and their annotation. MPEG-7 information includes information pertaining to the actual title of the video content, copyright ownership, storage format, spatial and time information such as scene cuts, information about colors, textures and sound, information about the objects in a video scene, summaries, and content index information. Some of the MPEG-7 image feature descriptors include silhouettes of objects or outlines of objects.
Although MPEG-7 contains video metadata, and does not contain the actual video information, other emerging standards, such as MPEG-47, combine both compressed MPEG-4 video and the MPEG-7 multimedia content description interface into a single combined data storage format. Another example is the proprietary Adobe Flash technology, which combines both video and metadata.
One of the reasons for slow penetration of interactive video applications is due to the fact that the need to link metadata to the video content necessarily requires a different video distribution scheme. The video content must be packaged together in a special format (like one of the mentioned above), which implies changes on the content distributor side. The video client, on the other hand, is required to support such type of packaging.
As one of the consequences, in many cases legacy video content cannot be made interactive: if one has a collection of DVD movies, there is no way to incorporate metadata into the information stored on the DVD since it is a read-only medium.
Another factor explaining the slow penetration of interactive video is that traditionally, video viewing is considered a “lay-back” experience, where the user has no or very little interaction. That is, the user will “lay back” in his or her chair or couch, and will generally plan on watching the video with little or no interaction. Hypervideo experience, on the other hand, requires certain amount of interaction from the user, and thus can be considered a “lean-forward” experience. Here the user will generally be sitting by a computer or handheld device, and will be “leaning forward” in order to interact with the device—that is, use a mouse or other pointer to click various images and hyperlinks of image. A “lean forward” user expects to interact with the video, control the video, and obtain further information.
Thus, the main environment considered for hypervideo applications is a PC, which is “lean-forward” and naturally allows for video interaction using the PC pointing device (mouse). People using a PC as a TV replacement for watching video content are reported to perform additional tasks simultaneously or in addition with content watching: web surfing, searching information about actors or events in the movie, etc.
A big challenge tampering with a wide adoption of interactive video applications is porting the hypervideo experience to the TV “lay-back” environment. The multitasking nature of a PC environment is absent is a TV. A typical TV user has minimum interaction with the device, in the extreme case boiling down to starting the movie and watching it continuously without interruption.
Recently, it is reported that some users try to combine the lean-forward and lay-back experience, having a portable device while watching the TV. The TV is used to watch the content without interruption, and the portable device is used to access information according to interests arising during the content playback. Also, in many cases when the video content is watched by multiple persons, each of them has a separate mobile device and accesses different information independently and simultaneously.
There is also an important social aspect in the aforementioned environments. Typically, TV watching is a social experience, in which multiple persons (e.g. members of the same family) share the same experience. The PC environment has more of an individualistic character, where the user alone controls his experience.
Thus, there exists a need for improved systems, methods and devices that enhance a user's video experiences. As will be seen, the invention enables such systems, methods and devices in a novel and useful manner.
Previous efforts to improve the interactivity of television included various types of remote control devices and video on demand systems, such as the devices and systems taught in U.S. Pat. Nos. 6,678,740, 6,857,132, 6,889,385, and 6,970,127 by Rakib, U.S. Pat. No. 7,344,084 by DaCosta, and US patent applications 20020059637, 2002044225, and 20020031120 by Rakib.