Video information is being produced at an ever-increasing rate and video sequences, especially short sequences, are increasingly being used, for example, in websites and on CD-ROM, and being created, for example, by domestic use of camcorders. There is a growing need for tools enabling the indexing, handling and interaction with video data. It is particularly necessary for interfaces to be provided which enable a user to access video information selectively and to interact with that information, especially in a non-sequential way.
Conventionally, video information consists of a sequence of frames recorded at a fixed time interval. In the case of classic television signals, for example, the video information consists of 25 or 30 frames per second. Each frame is meaningful since it corresponds to an image which can be viewed. A frame may be made up of a number of interlaced fields, but this is not obligatory as is seen from more recently proposed video formats, such as those intended for high definition television. Frames describe the temporal decomposition of the video image information. Each frame contains image information structured in terms of lines and pixels, which represent the spatial decomposition of the video.
In the present document, the terms “video information” or “video sequences” refer to data representing a visual image recorded over a given time period, without reference to the length of that time period or the structure of the recorded information. Thus, the term “video sequence” will be used to refer to any series of video frames, regardless of whether this series corresponds to a single camera shot (recorded between two cuts) or to a plurality of shots or scenes.
Traditionally, if a user desired to know what was the content of a particular video sequence he was obliged to watch as each frame, or a sub-sample of the frames, of the sequence was displayed successively in time. (For purposes of this document, the terms “he,” “him,” or “his” are used for convenience in place of she/he, her/him and hers/his, and are intended to be gender-neutral.) This approach is still widespread, and in applications where video data is accessed using a personal computer, the interface to the video often consists of a displayed window in which the video sequence is contained and a set of displayed controls similar to those found on a video tape recorder (allowing fastforward, rewind, etc.).
Developments in the fields of video indexing and video editing have provided other forms of interface to video information.
In the field of video indexing, it is necessary to code information contained in a video sequence in order to enable subsequent retrieval of the sequence from a database by reference to keywords or concepts. The coded content may, for example, identify the types of objects present in the video sequence, their properties/motion, the type of camera movements involved in the video sequence (pan, tracking shot, zoom, etc.), and other properties. A “summary” of the coded document may be prepared, consisting of certain representative frames taken from the sequence, together with text information or icons indicating how the sequence has been coded. The interface for interacting with the video database typically includes a computer input device enabling the user to specify objects or properties of interest and, in response to the query, the computer determines which video sequences in the database correspond to the input search terms and displays the appropriate “summaries”. The user then indicates whether or not a particular video sequence should be reproduced. Examples of products using this approach are described in the article “Advanced Imaging Product Survey: Photo, Document and Video” from the journal “Advanced Imaging”, October 1994, which document is incorporated herein by this reference.
In some video indexing schemes, the video sequence is divided up into shorter series of frames based upon the scene changes or the semantic content of the video information. A hierarchical structure may be defined. Index “summaries” may be produced for the different series of frames corresponding to nodes in the hierarchical structure. In such a case, at the time when a search is made, the “summary” corresponding to a complete video sequence may be retrieved for display to the user who is then allowed to request display of “summaries” relating to sub-sections of the video sequence which are lower down in the hierarchical structure. If the user so wishes, a selected sequence or sub-section is reproduced on the display monitor. Such a scheme is described in EP-A-0 555 028 which is incorporated herein by this reference.
A disadvantage of such traditional, indexing/searching interfaces to video sequences is that the dynamic quality of the video information is lost.
Another approach, derived from the field of video editing, consists of the “digital storyboard”. The video sequence is segmented into scenes and one or more representative frames from each scene is selected and displayed, usually accompanied by text information, side-by-side with representative frames from other segments. The user now has both a visual overview of all the scenes and a direct visual access to individual scenes. Each representative frame of the storyboard can be considered to be an icon. Selection of the icon via a pointing device (typically a mouse-controlled cursor) causes the associated video sequence or subsequence to be reproduced. Typical layouts for the storyboards are two-dimensional arrays or long one-dimensional strips. In the first case, the user scans the icons from the left to the right, line by line, whereas in the second case the user needs to move the strip across the screen.
Digital storyboards are typically created by a video editor who views the video sequence, segments the data into individual scenes and places each scene, with a descriptive comment, onto the storyboard. As is well-known from technical literature, many steps of this process can be automated. For example, different techniques for automatic detection of scene changes are discussed in the following documents, each of which is incorporated herein by reference:                “A Real-time neural approach to scene cut detection” by Ardizzone et al, IS&T/SPLE-Storage & Retrieval for Image and Video Databases IV, San Jose, Calif.        “Digital Video Segmentation” by Hampapur et al, ACM Multimedia '94 Proceedings, ACM Press-1        “Extraction of News Articles based on Scene Cut Detection using DCT Clustering” by Ariki et al, International Conference on Image Processing, September 1996, Lausanne, Switzerland;        “Automatic partitioning of full-motion video” by HoncJiang Zhang et al, Multimedia Systems (Springer-Verfaa, 199')), 1, pages 10-28-, and        EP-A-0 590 759.        
Various methods for automatically detecting and tracking persons and objects in video sequences are considered in the following documents, each of which is incorporated herein by reference:                “Modeling, Analysis and Visualization of Nonrigid Object Motion”, by T. S. Huang, Proc. of International Conf. on Pattern Recognition, Vol. 1, pp 361-364, Atlantic City, N.J., Jun. 1990- and        “Segmentation of People in Motion” by Shio et al, Proc. IEEE, vol. 79, pp 325332, 1991. Techniques for automatically detecting different types of camera shot are described in        “Global zoom/pan estimation and compensation for video compression” by Tse et al, Proc. ICASSP, Vol.4, pp 2725-2728, May 1991; and        
“Differential estimation of the global motion parameters zoom and pan” by M. Hoetter, Signal Processing, Vol. 16, pp 249-265, 1989.
In the case of digital storyboards too, the dynamic quality of the video sequence is often lost or obscured. Some impression of the movement inherent in the video sequence can be preserved by selecting several frames to represent each scene, preferably frames which demonstrate the movement occurring in that scene. However, storyboardtype interfaces to video information remain awkward to use in view of the fact that multiple actions on the user's part are necessary in order to view and access data.
Attempts have been made to create a single visual image which represents both the content of individual views making up a video sequence and preserves the context, that is, the time-varying nature of the video image information.
One such approach creates a “trace” consisting of a single frame having superimposed images taken from different frames of the video sequence, these images being offset one from the other due to motion occurring between the different frames from which the images were taken. Thus, for example, in the case of a video sequence representing a sprinter running, the corresponding “trace” will include multiple probably overlapping) images of the sprinter, spaced in the direction in which the sprinter is running. Another approach of this kind generates a composite image, called a “salient still”, representative of the video sequence—see “Salient Video Stills: Content and Context Preserved” by Teodosio et al, Proc. ACM Multimedia 93, California, Aug. 1-6, 1993), pp 39-47 which article is incorporated herein by this reference in its entirety.
Still another approach of this general type consists in creation of a “video icon”, as described in the papers “Developing Power Tools for Video Indexinor and retrieval” by Zhang et al, SPIE, Vol.2185, pp 140-149-, and “Video Representation tools using a unified object and perspective based approach” by the present inventors, IS&T/SPIE Conference on Storage and Perusal for Image and Video Databases, San Jose, Calif., February 1995 which are incorporated herein by reference.
In a “video icon”, as illustrated in FIG. 1A, the scene is represented by a number of frames selected from the sequence and which are displayed as if they were stacked up one behind the other in the z-direction and are viewed in perspective. In other words, each individual frame is represented by a plane and the planes lie one behind the other with a slight offset. Typically the first frame of the stack is displayed in its entirety whereas underlying frames are partially occluded by the frames in front. The envelope of the stack of frames has a parallelepiped shape. The use of a number of frames, even if they are partially occluded, gives the user a more complete view of the scene and, thus, a better visual understanding. Furthermore, with some such icons, the user can directly access any frame represented in the icon.
Two special types of video icon have been proposed, “object based” video icons and video icons containing a representation of camera movement. In an “object based” video icon, as illustrated in FIG. 1B, objects of interest are isolated in the individual frames and, for at least some of the stacked frames, the only image information included in the video icon is the image information corresponding to the selected object. In such a video icon, at least some of the individual frames are represented as if they were transparent except in the regions containing the selected object. Video icons containing an indication of camera movement may have, as illustrated in the example of FIG. 1C, a serpentine-shaped envelope corresponding to the case of side-to-side motion of the camera.
The video icons discussed above present the user with information concerning the content of the whole of a video sequence and serve as a selection tool allowing the user to access-frames of the video sequence out of the usual order. In other words, these icons allow non-sequential access to the video sequence. Nevertheless, the ways in which the user can interact with the video sequence information are strictly limited. The user can select frames for playback in a non-sequential way but he has little or no means of obtaining a deeper level of information concerning the video sequence as a whole, short of watching a playback of the whole sequence.
The present invention provides a novel type of interface to video information which allows the user to access information concerning a video sequence in a highly versatile manner. In particular, interactive video interfaces of the present invention enable a user to obtain deeper levels of information concerning an associated video sequence at positions in the sequence which are designated by the user as being of interest.
The present invention provides an interface to information concerning an associated video sequence, one such interface comprising:
information defining a three-dimensional root image, the root image consisting of a plurality of basic frames selected from said video sequence, and/or a plurality of portions of video frames corresponding to selected objects represented in the video sequence, x and y directions in the root image corresponding to x and y directions in the video frames and the z direction in the root image corresponding to the time axis whereby the basic frames are spaced apart from one another in the z direction of the root image by distances corresponding to the time separation between the respective video frames;
means for displaying views of the root image;
means for designating a viewing position relative to said root image; and
means for calculating image data representing said three-dimensional root image viewed from the designated viewing position, and for outputting said calculated image data to the displaying means.
According to the present invention, customized user interfaces may be created for video sequences. These interfaces comprise a displayable “root” image which directly represents the content and context of the image information in the video sequence and can be manipulated, either automatically or by the user, in order to display further image information, by designation of a viewing position with respect to the root image, the representation of the displayed image being changed in response to changes in the designated viewing position. In a preferred embodiment of the present invention, the representation of the displayed image changes dependent upon the designated viewing position as if the root image were a three-dimensional object. In such preferred embodiments, as the designated viewing position changes, the data necessary to form the displayed representation of the root image is calculated so as to provide the correct perspective view given the viewing angle, the distance separating the viewing position from the displayed quasi-object and whether the viewing position is above or below the displayed quasi-object.
In a reduced form, the present invention can provide non-interactive interfaces to video sequences, in which the root image information is packaged with an associated script defining a routine for automatically displaying a sequence of different views of the root image and performing a set of manipulations on the displayed image, no user manipulation being permitted. However, the full benefits of the invention are best seen in interactive interfaces where the viewing position of the root image is designated by the user, as follows. When the user first accesses the interface he is presented with a displayed image which represents the root image seen from a particular viewpoint (which may be a predetermined reference viewpoint). As he designates different viewing angles, the displayed image represents the root image seen from different perspectives. When the user designates viewing positions at greater or lesser distances from the root image, the displayed image increases or reduces the size and, preferably, resolution of the displayed information, accessing image data from additional video frames, if need be.
The customized, interactive interfaces provided by the present invention involve displayed images, representing the respective associated video sequences, which, in some ways, could be considered to be a navigable environment or a manipulable object. This environment or object is a quasi-three-dimensional entity. The x and y dimensions of the environment/object correspond to true spatial dimensions (corresponding to the x and y directions in the associated video frames) whereas the z dimension of the environment/object corresponds to the time axis. These interfaces could be considered to constitute a development of the “video icons” discussed above, now rendered interactive and manipulable by the user.
With the interfaces provided by the present invention, the user can select spatial and temporal information from a video sequence for access by designating a viewing position with respect to a video icon representing the video sequence. Arbitrarily chosen oblique “viewing directions” are possible whereby the user simultaneously accesses image information corresponding to portions of a number of different frames in the video sequence. As the user's viewing position relative to the video icon changes, the amount of a given frame which is visible to him, and the number and selection of frames which he can see, changes correspondingly.
As mentioned above, the interactive video interfaces of the present invention make use of a “root” image comprising a plurality of basic frames arranged to form a quasi-three dimensional object. It is preferred that the relative placement positions of the basic frames be arranged so as to indicate visually some underlying motion in the video sequence. Thus, for example, if the video sequence corresponds to a travelling shot moving down a hallway and tuning a comer, the envelope of the set of basic frames preferably does not have a parallelepiped shape hut, instead, composes a “pipe” of rectangular section and bending, in a way corresponding to the camera travel during filming of the video sequence.
In preferred embodiments of video interfaces according to the present invention, the basic video frames making up the root image are chosen as a function of the amount of motion or change in the sequence. For example, in the case of a video sequence corresponding to a travelling shot, in which the background information changes, it is preferable that successive basic frames should include back-round information overlapping by, say, 50%.
In certain embodiments of the present invention, the root image corresponds to an “object-based video icon.” In other words, certain of the basic frames included in the root image are not included therein in full; only those portions corresponding to selected objects are included. Alternatively, or additionally, certain basic frames may be included in full in the root image but may include “hot objects,” that is, representations of objects selectable by the user. In response to selection of such “hot objects” by the user, the corresponding basic frames (and, if necessary, additional frames) are then displayed as if they had become transparent at all portions thereof except the portion(s) where the selected object or objects are displayed. The presence of such selectable objects in the root image allows the user to selectively isolate objects of interest in the video sequence and obtain at a glance a visual impression of the appearance and movement of the objects during the video sequence.
The interfaces of the present invention allow the user to select an arbitrary portion of the video sequence for playback. The user designates a portion of the video sequence which is of interest, by designating a corresponding portion of the displayed image forming part of the interface to the video sequence. This portion of the video sequence is than played back. The interface may include a displayed set of controls similar to those provided on a VCR in order to permit the user to select different modes for this playback, such as fast-forward, rewind, etc.
In preferred embodiments of interfaces according to the invention, the displayed image forming part of the interface remains visible whilst the designated portion of the sequence is being played back. This can be achieved in any number of ways, as for example, by providing a second display device upon which the playback takes place, or by designating a “playback window” on the display screen, this playback window being offset with respect to the screen area used by the interface, or by any other suitable means.
The preferred embodiments of interfaces according to the invention also permit the user to designate an object of interest and to select a playback mode in which only image information concerning that selected object is included in the playback. Furthermore, the user can select a single frame from the video sequence for display separately from the interactive displayed image generated by the interface.
In preferred embodiments, the interfaces of the present invention allow the user to generate a displayed image corresponding to a distortion of the root image. More especially, the displayed image can correspond to the root image subjected to an “accordion effect”, where the root image is “cracked open”, for example, by bending around a bend line so as to “fan out” video frames in the vicinity of the opening point, or is modified by linearly spreading apart video frames at a point of interest. The accordion effect can also be applied repetitively or otherwise in a nested fashion according to the present invention.
The present invention can provide user interfaces to “multi-threaded” video sequences, that is, video sequences consisting of numerous interrelated shorter segments such as are found, for example, in a video game where the user's choices change the scene which is displayed. Interfaces to such multi-threaded video sequences can include frames of the different video segments in the root image, such that the root image has a branching structure. Alternatively, some or all of the different threads may not be visible in the root image but may become visible as a result of user manipulation. For example, if the user expresses an interest in a particular region of the video sequence by designating a portion of a displayed root image using a pointing device (such as a mouse, or by touching a touch screen, etc.) then if multiple different threads of the sequence start from the designated area, image portions for these different threads may be added to the displayed image.
In preferred embodiments of interfaces according to the present invention, the root image for the video sequence concerned is associated with information defining how the corresponding displayed image will change in response to given types of user manipulation. Thus, for example, this associated information may define how many, or which additional frames are displayed when the user moves the viewing position closer up to the root image. Similarly, the associated information may identify which objects in the scene are “hot objects” and what image information will be displayed in relation to these hot objects when activated by the user.
Furthermore, different possibilities exist for delivering the components of the interface to the end user. In an application where video sequences are transmitted to a user over a telecommunications path, such as via the Internet, the user who is interested in a particular video sequence may first download only certain components of the associated interface. First of all he downloads information for generating a displayed view of the root image, together with an associated application program (if he does not already have an appropriate “interface player” loaded in his computer). The downloaded (or already-resident) application program includes basic routines for chancing the perspective of the displayed image in response to changes in the viewing position designated by the user. The application program is also adapted to consult any “associated information” (as mentioned above) which forms part of the interface and conditions the way in which the displayed image changes in response to certain predetermined user manipulations (such as “zoom-in” and “activate object”). If the interface does not contain any such “associated information” then the application program makes use of pre-set default parameters.
The root image corresponds to a particular set of basic video frames and information designating relative placement positions thereof. The root image information downloaded to the user may include just the data necessary to create a reference view of the root image or it may include the image data for the set of basic frames (in order to enable the changes in user viewing angle to be catered for without the need to download additional information). In a case where the user performs a manipulation which requires display of video information which is not present in the root image (e.g. he “zooms in” such that data from additional frames is required), this extra information can either be pre-packaged and supplied with the root image information or the extra information can be downloaded from the host website as and when it is needed.
Similar possibilities exist in the case of interfaces provided on CD-ROM. In general, the root image and other associated information will be provided on the CD-ROM in addition to the full video sequence. However, it is to be understood that, for reasons of space saving, catalogues of video sequences could be made consisting solely of interfaces, without the corresponding full video sequences.
In addition to providing the interfaces themselves, the present invention also provides apparatus for creation of interfaces according to the present invention. This may be dedicated hardware or, more preferably, a computer system programmed in accordance with specially designed computer programs.
Various of the steps involved in creation of a customized interface according to the present invention can be automated. Thus, for example, the selection of basic frames for inclusion in the “root image” of the interface can be made automatically according to one of a number of different algorithms, such as choosinbg one frame every n frames, or choosing 1 frame every time the camera movement has displaced the background by m%, etc. Similarly, the relative placement positions of the basic frames in the root image can be set automatically taking into account the time separation between those frames and, if desired, other factors such as camera motion. Similarly, the presence of objects or people in the video sequence can be detected automatically according to one of the known algorithms (such as those discussed in the references cited above), and an “object oriented” root image can be created automatically. Thus, in some embodiments, the interface creation apparatus of the present invention has the capability of automatically processing video sequence information in order to produce a root image. These embodiments include means for associating with the root image a standard set of routines for changing the representation of the displayed image in response to user manipulations.
However, it is often preferable actively to design the characteristics of interactive interfaces according to the invention, such that the ways in which the end user can interact with the video information are limited or channeled in preferred directions, This is particularly true in the case of video sequences which are advertisements or are used in educational software and the like.
Thus, the present invention provides a toolkit for use in creation of customized interfaces. In preferred embodiments, the toolkit enables a designer to tailor the configuration and content of the root image, as well as to specify which objects in the video sequence are “hot objects” and to control the way in which the displayed interface image will change in response to manipulation by an end user. Thus, among other things, the toolkit enables the interface designer to determine which frames of the video sequence should be used as basic frames in the root image, and how many additional frames are added to the displayed image when the user designates a viewing position close to the root image.
According to another aspect, the invention relates to network ditribution and management of interactive video and multi-media containers. A need exists for methods and systems for transmitting video and other multi-media files across a network, such as the Internet. U.S. Pat. No. 5,956,716 to Kenner et al. provides an example of a system and method for the delivery of video data over a computer network. In Kenner, a user uses a multimedia terminal to send a request for video clips from a database. A local storage and retrieval module receives and processes video clip requsts and a primary index manager causes the distribution of video clips among a plurality of extended storage and retrieval modules. The extended storage and retrieval modules store a plurality of databases including those that contain video clips. A data sequencing interface directs the extended storage and retrieval module to download the requested video clips. The video clips are then downloaded to the multimedia terminal via the local storage and retrieval module.
Systems and methods according to the invention provide for the network distribution and management of interactive video and multi-media containers. Systems and methods not only can distribute video and other multi-media files but they can also distribute multi-media containers. Consequently, users would be able to access information concerning the mult-imedia files in a highly versatile manner. Systems and methods according to the invention also enable for the transmission of information both to and from the users. Thus, systems and methods according to the invention provide for colloboration between users. For instance, work performed by one user in indexing or in providing annotations is not restricted to just that user but can be shared with others having access to the multi-media file. Other advantages and benefits of the invention are provided in the following description and will be apparent to those skilled in the art.