1. Field of the Invention
The present invention relates to the field of processing audio-visual recordings for the purpose of automatically indexing the recordings according to content. Specifically, the present invention pertains to the field of finding segments in recorded meetings that correspond to individual oral presentations.
2. Discussion of the Related Art
Conventional approaches were concerned with segmenting audio only, thus there was no video channel to exploit. Using uniform-duration windows to provide initial data for speaker clustering has been attempted. This led to problems with the initial segmentation, as only short windows at arbitrary times could be used for the initial clustering. If windows were too long, then chances of capturing multiple speakers were high, however too short a window resulted in insufficient data for good clustering. In the absence of additional cues, windows often overlapped a change in speaker, making them less useful for clustering. Most conventional segmentation work has also based primarily on audio, for example, meeting segmentation using speech recognition from close-talking lapel microphones.
Many meetings contain slide presentations by one or more speakers, for example, weekly staff meetings. These meetings are often recorded on audio-visual recording media for future review and reuse. For browsing and retrieval of the contents of such meetings, it is useful to locate the time extent, for example the start and end time, of each individual oral presentation within the recorded meetings.
According to the present invention, automatic image recognition provides cues for audio-based speaker identification to precisely segment individual presentations within video recordings of meetings. Video transform feature vectors are used to identify video frames intervals known to be associated with single-speaker audio intervals. The audio intervals are used to train a speaker recognition system for audio segmentation of the audio-visual recording.
In a presently preferred embodiment, it is assumed that single-speaker oral presentations includes intervals where slides are being displayed, and that a particular speaker will speak for the entire duration that the each slide is being displayed. Single-speaker regions of the audio-visual recording are identified in the video by searching for extended intervals of slide images. Slide intervals are automatically detected, and the audio in these regions is used to train an audio speaker spotting system. A single speaker""s presentation may also contain camera shots of the speaker and audience. Since a presentation by a given speaker can span multiple slide intervals, the audio intervals corresponding to the slide intervals are clustered by audio similarity to find the number and order of the speakers giving presentations in the video. After clustering, all audio data from a single speaker is used to train a source-specific speaker model for identifying a specific speaker from the audio portion of the audio-visual recording. The audio is then segmented using the speaker spotting system, yielding a sequence of single speaker intervals for indexing against the audio-visual recording.
Alternatively, video analyses searching for members of video image classes other than slides, such as single-face detection, or detecting a person standing in front of a podium, instead of slide detection, are used to detect intervals for which the audio comes from a single speaker. Any detectable feature in the video known to be associated with a single speaker can be used according to the present invention. In general, any video image class known to be correlated with single-speaker audio intervals is used to detect intervals for which the audio comes from a single speaker.
In an alternative embodiment, face recognition detects frame intervals corresponding to each speaker. In this embodiment, face recognition of a specific speaker associates video intervals of that speaker to audio intervals from that speaker. Thus, face recognition replaces the audio clustering method of the preferred embodiment which distinguishes audio intervals from different speakers. Face recognition associates recognized frames with speech from one speaker. For example, first and second video image classes corresponding to first and second speakers are used to detect frame intervals corresponding to first and second speakers, respectively.
According to the present invention, regions of recorded meetings corresponding to individual presentations are automatically found. Once presentations have been located, the region information may be used for indexing and browsing the video. In cases where there is an agenda associated with the meeting, located presentations can be automatically labeled with information obtained from the agenda. This allows presentations to be easily found by presenter and topic.
The methods of the present invention are easily extended to across multiple meeting videos and to other domains such as broadcast news. These and other features and advantages of the present invention are more fully described in the Detailed Description of the Invention with reference to the Figures.