1. Field of the Invention
The present invention generally relates to an invention of matching electronic slides or the images contained in electronic slides to their appearance in a video stream.
2. Description of the Related Art
Detecting and recognizing slides or transparencies used by a speaker in a video is important for detecting teaching-related events in video. This is particularly important in the domain of distributed or distance learning whose long-standing goal has been to provide a quality of learning comparable to the face-to-face environment of a traditional classroom for teaching or training. Effective preparation of online multimedia courses or training material for users is currently ridden with problems of high cost of manual indexing, slow turnaround, and inconsistencies from human interpretation. Automatic methods of cross-linking and indexing multimedia information are very desirable in such applications, as they can provide an ability to respond to higher level semantic queries, such as for the retrieval of learning material relating to a topic of discussion. Automatic cross-linking multimedia information, however, is a non-trivial problem as it requires the detection and identification of events whose common threads appear in multiple information modalities. An example of such an event are points in a video where a topic was discussed. From a survey of the distance learning community, it has been found that the single most useful query found by students is the querying of topic of interest in a long recorded video of a course lecture. Such classroom lectures and talks are often accompanied by foils (also called slides) some of which convey the topic being discussed at that point in time. When such lectures are video taped, at least one of the cameras used captures the displayed slide, so that the visual appearance of a slide in video can be a good indication of the beginning of a discussion relating to a topic.
The recognition of slides in the video, however, is a challenging problem for several reasons. First, the imaging scenario in which a lecture or talk was taped could be of a variety of forms. There could be a single camera looking at the speaker, the screen showing the slide, and the audience. Also, the camera often zooms or pans. Alternately, a single camera could be dedicated to looking at the screen, or the overhead projector where the transparency is being displayed, while another camera looks at the speaker. The final video in this case, may have been edited to merge the two video streams.
Thus, the slide appearing in a video stream could consume an entire video frame or be a region in a video frame. Secondly, depending on the distance between the camera and the projected slide, the scale at which the slide image appears in the video could be reduced, making it difficult to read the text on slide and/or making it hard to detect text using conventional text recognition methods. Additionally, the color on the slides undergoes transformation due to projection geometry, lighting conditions of the scene, noise in the video capture and MPEG conversion. Finally, the slide image often appears occluded and/or skewed. For example, the slide image may be only partially visible because another object (eg., person giving the talk) is covering or blocking the slide image.
Previous approaches of slide detection have worked on the premise that the camera is focused on the slides, so that a simple change detection through frame subtraction can be used to detect changes in slides. There has been some work done in the multimedia authoring community to address this problem from the point of synchronization of foils with video. The predominant approach has been to do on-line synchronization using a structured note-taking environment to record the times of change of slide electronically and synchronize with the video stream. Current presentation environments such as Lotus Freelance or Powerpoint have features that can also record the change time of slides when performed in rehearse modes. In distributed learning, however, there is often a need for off-line synchronization of slides since they are often provided by the teacher after a video recording of the lecture has been made. The detection of foils in a video stream under these settings can be challenging. A solution to this problem has been proposed for a two-camera geometry, in which one of the cameras was fixed on the screen depicting the slide. Since this was a more-or-less calibrated setting, the boundary of the slide was visible so that the task of selecting a slide-containing region in a video frame was made easy. Further, corners of the visible quadrilateral structure could be used to solve for the xe2x80x98posexe2x80x99 of the slide under the general projective transform. Therefore, there is a need for a system which can recognize slides, regardless of whether the slides are distorted, blurred, or partially blocked. A system and process for xe2x80x9crecognizingxe2x80x9d slides has not been explored before. The inventive approach to foil detection is meant to consider more general imaging situations involving one or more cameras, and greater variations in scale, pose and occlusions.
Other related art includes the principle of geometric hashing and its variants[Lamdan and Wolfson, Proc. Int. Conf. Computer Vision,(ICCV) 1988, Rigoustous and Wolfson, Spl. Issue on geometric hashing, IEEE Computational Science and Engg., 1997, Wolfson, xe2x80x9cOn curve matchingxe2x80x9d in IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 12, pp.483-489, 1990 incorporated herein by reference], that has been applied earlier to the problem of model indexing in computer vision, and the related technique of location hashing that has also been disclosed earlier [Syeda-Mahmood, Proc. Intl. Conf. Computer Vision and Pattern Recognition, CVPR 1999, incorporated herein by reference].
It is, therefore, an object of the present invention to match electronic slides to their appearance in a video stream. This is accomplished by first generating a set of keyframes from the video stream. Next, likely matches of electronic slide images and images in the keyframes are identified by color-matching the slide images with regions on the keyframes. Next, the geometric features of slide images are extracted from the slides. Geometric features are also extracted from the regions within keyframes. The invention reduces the structure of the slide images and the keyframe images to a corresponding set of affine coordinates. The invention places the slide image affine coordinates in a balanced binary search tree called hash tree for efficiency of search.
The invention matches the slide images and keyframe images by indexing the hash tree with the keyframe image affine coordinates. When matches to the keyframe image affine coordinates are found in the hash tree, the invention records hits in a histogram. Lastly, each keyframe is paired with the slide image with the most hits, and the paired keyframe is designated as containing the corresponding slide image.
In summary, the present invention performs a method for matching slides to video comprising, generating keyframes from the video, extracting geometric keyframe features from the keyframes and geometric slide features from the slides, and matching the geometric slide features and the geometric keyframe features via efficient indexing using a hash tree so as to avoid an exhaustive search of slides per video frame.