1. Field of the Invention
The present invention generally relates to systems that search and index different modes of media (audio, text, video, etc.) based upon a query and more particularly to an improved system that indexes one form of media using keywords from another form of media using a relevance-based time overlap between the matches in the different media modes.
2. Description of the Related Art
Information can be stored in many different forms (modes). Before the advent of audio recordings, the only form to record information was the written word, written symbols, and/or numbers. Subsequently, audio and video recordings were used to supplement or replace written information. Regardless of the mode in which information is recorded, there is always a need to search and index the information so that only relevant portions need to be reviewed when the user has a question (query) on a very specific topic.
Conventional searches primarily involve keyword queries of previously created text or indexes. Thus, it is common to perform a simple Boolean combination such as AND/OR, or perform a search based on individual relevance scores of the textural data. However, with the increasing use of different media modes to record information, there is a need to logically search video, audio, and graphics, as well as textual information.
Detecting and recognizing slides or transparencies used by a speaker in a video is important for detecting teaching-related events in video. This is particularly important in the domain of distributed or distance learning whose long-standing goal has been to provide a quality of learning comparable to the face-to-face environment of a traditional classroom for teaching or training. Effective preparation of online multimedia courses or training material for users is currently ridden with problems including the high cost of manual indexing, slow turnaround, and inconsistencies from human interpretation. Automatic methods of cross-linking and indexing multimedia information are very desirable in such applications, as they can provide an ability to respond to higher level semantic queries, such as for the retrieval of learning material relating to a topic of discussion.
Automatic cross-linking multimedia information, however, is a non-trivial problem as it requires the detection and identification of events whose common threads appear in multiple information modalities. An example of such events include points in a video where a topic was discussed. From a survey of the distance learning community, it has been found that the single most useful query found by students is the querying of topic of interest in a long recorded video of a course lecture. Such classroom lectures and talks are often accompanied by foils (also called slides), some of which convey the topic being discussed at that point in time. When such lectures are video taped, at least one of the cameras can be used to capture the displayed slide, so that the visual appearance of a slide in video can be a good indication of the beginning of a discussion relating to a topic.
The recognition of slides in the video, however, is a challenging problem for several reasons. First, the imaging scenario in which a lecture or talk was taped could be of a variety of forms. There could be a single camera looking at the speaker, the screen showing the slide, and the audience. Also, the camera often zooms or pans. Alternately, a single camera could be dedicated to looking at the screen, or the overhead projector where the transparency is being displayed, while another camera looks at the speaker. The final video in this case, may have been edited to merge the two video streams.
Thus, the slide appearing in a video stream could consume an entire video frame or be a region in a video frame. Secondly, depending on the distance between the camera and the projected slide, the scale at which the slide image appears in the video could be reduced, making it difficult to read the text or slide and/or make it hard to detect text using conventional text recognition methods. Additionally, the color on the slides undergoes transformation due to projection geometry, lighting conditions of the scene, noise in the video capture and MPEG conversion. Finally, the slide image often appears occluded and/or skewed. For example, the slide image may be only partially visible because another object (e.g., person giving the talk) is covering or blocking the slide image.
Previous approaches of slide detection have worked on the premise that the camera is focused on the slides, so that a simple change detection through frame subtraction can be used to detect changes in slides. There has been some work done in the multimedia authoring community to address this problem from the point of synchronization of foils with video. The predominant approach has been to do on-line synchronization using a structured note-taking environment to record the times of change of slide electronically and synchronize with the video stream.
Current presentation environments such as Lotus Freelance or Powerpoint have features that can also record the change time of slides when performed in rehearse modes. In distributed learning, however, there is often a need for off-line synchronization of slides since they are often provided by the teacher after a video recording of the lecture has been made.
The detection of foils in a video stream under these settings can be challenging. A solution to this problem has been proposed for a two-camera geometry, in which one of the cameras was fixed on the screen depicting the slide. Since this was a more-or-less calibrated setting, the boundary of the slide was visible so that the task of selecting a slide-containing region in a video frame was made easy. Further, corners of the visible quadrilateral structure could be used to solve for the xe2x80x98posexe2x80x99 of the slide under the general projective transform.
Therefore, there is a need for a system which can recognize slides, regardless of whether the slides are distorted, blurred, or partially blocked. A system and process for xe2x80x9crecognizingxe2x80x9d slides and indexing the video using such information has not been explored before. The inventive approach to foil detection is meant to consider more general imaging situations involving one or more cameras, and greater variations in scale, pose and occlusions.
Other related art includes the principle of geometric hashing and its variants[Lamdan and Wolfson, Proc. Int. Conf. Computer Vision,(ICCV) 1988, Rigoustous and Wolfson, Spl. Issue on geometric hashing, IEEE Computational Science and Engg., 1997, Wolfson, xe2x80x9cOn curve matchingxe2x80x9d in IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 12, pp.483-489, 1990, incorporated herein by reference], that has been applied earlier to the problem of model indexing in computer vision, and the related technique of location hashing that has also been disclosed earlier [Syeda-Mahmood, Proc. Intl. Conf. Computer Vision and Pattern Recognition, CVPR 1999, incorporated herein by reference].
It is, therefore, an object of the present invention to provide a structure and method for indexing multi-media data comprising deriving keywords from a first media type (slides), matching the keywords to a second media type (video), identifying an appearance of the first media type in the second media type, and calculating a co-occurrence of the keywords and the appearance of the first media type in the second media type. The invention produces an index of the second media type based on the co-occurrence.
The identifying of the appearence of the first media type in the second media type includes generating keyframes from the second media type, extracting geometric keyframe features from the keyframes and geometric slide features from the first media type and matching the geometric slide features and the geometric keyframe features.
The invention further identifies background matching regions in the keyframes having colors matching colors of backgrounds in the second media type. The extracting of the geometric keyframe features is performed only in the background matching regions. The matching identifies which portion of the first media type has a highest number of geometric slide features matching geometric keyframe features in a keyframe.
The extracting process includes identifying changes in image intensity as edges, forming curves connecting the edges, identifying comers where the curves change direction, grouping the curves into curve-groups, and designating a sequence of three consecutive features in each of the curve-groups as basis triples. The matching process comprises computing coordinates of the basis triples, and identifying which portion of the first media type has a highest number of basis triples matching basis triples in a keyframe. Generating the keyframes comprises dividing the second media type into portions based upon scene changes and selecting one frame from each portion of the second media type as a keyframe. The calculating process comprises processing the multi-media data to extract relevance scores and time reference points of matches to individual media modes, identifying overlapping time periods of matching keywords and the appearance of the first media type in the second media type, and ranking the relevance of the overlapping time periods. The ranking can include finding an overlapping time period having a highest relevance score, segmenting the overlapping time period to identify beginning and ending events, and finding the largest number of different modes of overlap. Such modes can comprise audio, video, text, and graphic display.