Technology exists for recording meetings, presentations, television shows, sporting events, and the like. In many situations, a viewer of such a recording, or media item, would like to identify and navigate to specific sections that are of interest, without having to watch the entire recording.
When viewing or listening to a video or audio recording of a presentation or other stored, temporally linear media item, it is often difficult and time-consuming to find those portions of the recording that are of particular interest. Unlike text-based or image-based media, such recordings are difficult to parse effectively so as to skip unimportant sections and identify important sections.
Viewers can speed-search when watching a recording, in order to view the video portion at increased speed; typically, however, the audio portion is silenced during speed-search operations (and in any case would generally be undecipherable if it weren't silenced). For presentations consisting primarily of one or more individuals talking, the subject matter being discussed is not readily discernable from the visual component of the recording. In such situations, speed-searching is of little use in quickly navigating to interesting sections.
In addition, even when the visual component does reveal something about the subject matter being discussed (for example if a presentation includes overhead projection of slides), it can still be rather time-consuming for a viewer to speed-search through a lengthy recording, or several recordings, at a speed that is slow enough to be able to interpret the visual component and ascertain the subject matter being discussed. Furthermore, speed-searching in such a manner requires undivided concentration on the part of the viewer, so as not to miss a slide or other visual component in the fleeting moment it appears on the screen.
Manual generation of an index is possible. For example, conventional DVDs usually provide a chapter index that allows a viewer to skip to a particular section that is of interest. An on-screen menu and/or printed insert accompanying the disc provide chapter titles, still frames, and/or other information about each indexed chapter, so as to assist the viewer in determining which section to watch. However, such indexes are manually, not automatically, generated.
Transcription of a recording yields a scannable text representation that can be associated, for example via real-time counter values, with the original recording. A user can skim the transcription, identify the section of interest, and then navigate to that section in the recording. In some cases, the transcription itself may provide sufficient information that the user need not even consult the original recording. However, such transcriptions often omit important information (such as visual components accompanying the dialogue, including gestures, demonstrations, and the like); furthermore, creating and indexing a transcription often must be done manually.
Some prior art systems attempt to deduce which sections of a recording are of interest by analyzing noise levels, sound localizations, scene changes, and the like. Some systems detect events such as silences, applause, slide transitions, and the like. Systems employing such technology are described, for example, in Girgensohn et al., U.S. Pat. No. 6,366,296, entitled “Media Browser Using Multimodal Analysis” and Girgensohn et al., U.S. patent application Publication No. 2002/0054083A1, entitled “Media Browser Using Multimodal Analysis.”
Often, however, such measurable characteristics of the recording are unreliable and inaccurate in terms of their ability to successfully identify sections of importance or interest. For example, an increase in noise level can result from the laughter following a joke told by the speaker, or it can indicate applause at the introduction of a new speaker, or shuffling during a break, or a heated discussion, or any of a number of other events, some of which are of interest and some of which are not. Interesting sections of the presentation may be relatively quiet, as the audience watches raptly. Indexing based on noise levels or similar measurable metrics thus fails to accurately reflect the level of interest of any given section of the recording.
Generally, existing methods of identifying sections of interest are either inaccurate or too time-consuming and impractical to be employed for large quantities of recordings of routine presentations, meetings, and the like. As a result, useful information that is buried within video and audio recordings often goes unwatched and is effectively irretrievable.
Interclipper, available from DocuMat LLC of Newark, N.J., allows users to bookmark significant sections of a video recording by clicking a button. Several individuals can bookmark highlights of the same event and then retrieve their own highlights separately using specially coded markers. There is no indication, however, that Interclipper performs any type of collating of bookmarks, so as to generate an overall level-of-interest metric for various sections of the recording. See also L. He et al., “Auto-Summarization of Audio-Video Presentations,” in Proc. Multimedia '99, 1999.
Minneman, S. L. and Harrison, S. R., “Where Were We: Making and Using Near-Synchronous Pre-Narrative Video,” Proc. ACM Multimedia (MM'93), August 1993, Anaheim, USA, pp. 207-214, describes a system that allows an attendee to add annotations in real time during a presentation, but does not process or collate the annotations.
K. Weber and A. Poon, “Marquee: A Tool for RealTime Video Logging,” Proc. CHI 94, ACM Press, New York, 1994, pp. 58-64, describes a pen-based video logging tool that allows users to take notes in real time during a presentation and associate the notes with a video stream recording of the presentation. However, Marquee does not provide functionality for collating or combining annotations made by multiple users to generate a level-of-interest metric.
Chiu et al., U.S. patent application Publication No. US2002/0161804A1, entitled “Internet-Based System for Multimedia Meeting Minutes” describes a note-taking system that is capable of synchronizing entered notes with a multimedia stream, by automatically associating received notations with appropriate portions of the multimedia stream. However, Chiu et al. does not describe techniques for collating or combining annotations made by multiple users to generate a level-of-interest metric.
R. C. Davis et al., “NotePals: Lightweight Note Taking by the Group, for the Group,” UC Berkeley Computer Science Division Technical Report UCB//CSD-98-997, describes a collaborative note-taking system that allows multiple users to share their notes. There is no description, however, of any techniques for collating or combining annotations made by multiple users to generate a level-of-interest metric for a presentation.
What is needed is an effective and reliable technique for automatically determining and presenting level-of-interest indicators for a video or audio recording, without requiring manual effort. What is further needed is a technique that allows a viewer to determine what sections of such a recording are of interest, and to easily navigate to such sections. What is further needed is a technique that avoids the limitations of the prior art, as discussed above.