Audio and video recordings have become commonplace with the advent of consumer grade recording equipment. It is no longer rare to find a business meeting, a lecture, or a birthday party being recorded as a historical record for later reviewing. Unfortunately, both the audio and video mediums provide few external or audio clues to assist in accessing the desired section of the recording. In books, indexing is provided by the table of contents at the front and the index at the end, which readers can browse to locate authors and references to authors. A similar indexing scheme would be useful in an audio stream, so that users could locate sections where specific speakers were talking. The limited amount of data associated with most video recordings does not provide enough information for the viewer to confidently and easily access desired points of interest. Instead they must peruse the contents of a recording in sequence to retrieve desired information.
Retrieval can be aided by notes taken during the recording, for example notes which indicate the speaker and the topic. While this provides a structural outline, the lack of direct correlation between the video medium and the notation medium forces interpolation of the time stamps on the video with the content of the notes. This is complicated by the fact that notes for events in non-correlated media do not usually include the events' durations. In addition, such notetaking or indexing is quite burdensome. Computer systems may be used for notetaking during events, which may be recorded simultaneously, or pre-recorded. Text-based systems using keyboards may be used in these instances, but since most people talk much faster than they type, creating computer generated textual labels to describe the content in real time requires enormous effort.
Alternatively, the method of the present invention enables retrieval based on indexing an audio stream of a recording according to the speaker. In particular, an audio stream may be segmented into speaker events, and each segment labeled with the type of event, or speaker identity. When speech from individuals is intermixed, for example in conversational situations, the audio stream may be segregated into events according to speaker difference, with segments created by the same speaker identified or marked.
Speaker change markers showing different speakers in the audio stream may allow random access to otherwise sequential data. In a real-time setting, such audio segmenting may aid in creating a usable index into a recording as it is being made. Each segment represents an utterance by a single individual. Utterances by the same speaker are combined and similarly referenced to form an index. Identification of pauses, or silence intervals, in conversational speech is also important in audio indexing.
Creating an index into an audio stream, either in real time or in postprocessing, may enable a user to locate particular segments of the audio data. For example, this may enable a user to browse a recording to select audio segments corresponding to a specific speaker, or "fast-forward" through a recording to the next speaker. In addition, knowing the ordering of speakers can also provide content clues about the conversation, or about the context of the conversation.
Gish et al., "Segregation of Speakers for Speech Recognition and Speaker Identification," Proc. Int. Conf. Acoustics, Speech and Signal Processing, May 1991, vol. 2 pp. 873-876 describe a method for segmenting speech using hierarchical clustering. A dendrogram is constructed based on iteratively merging the pair of segments with smallest distance. The distance between segments is based on the likelihood ratio of two segments being from the same speaker vs. the two segments being from different speakers. The application described involves separating speech from an air traffic controller and various pilots, by identifying the largest cluster in the dendrogram with the controller, and all others with the pilots. They do not discuss methods for separating the pilots, although cuts through the dendrogram might be used.
While this technique could be used for non-real-time speaker segmentation, the method of the present invention offers several improvements. First, the likelihood ratio used by Gish et al. is based on a single Gaussian, while the present method uses tied Gaussian mixtures, which we have shown improves performance. Second, the hierarchical clustering algorithm of the present method recomputes pairwise distances, thus providing effectively longer segments for the distance measure, which is known to improve accuracy. Third, hidden Markov modeling is applied in the present method so that the time resolution of the segmentation is on the order of 20 ms rather than on the several second segments used in the hierarchical clustering. Finally, the present method proposes a resegmentation algorithm which iteratively improves the Hidden Markov Model (HMM)-based segmentation.
Siu et al., "An Unsupervised Sequential Learning Algorithm for the Segmentation of Speech Waveforms with Multiple Speakers," Proc. Int. Conf. Acoustics, Speech and Signal Processing, March 1992, vol. 2 pp. 189-192, describe a method for separating several air traffic controllers from pilots. Silence segments are identified first, and initial speech segments are identified as those regions between silence. These segments are grouped into regions containing 50 speech segments, and the assumption is made that in these regions there is a single air traffic controller. Hierarchical clustering is then performed as in Gish et. al., resulting in a cluster for the controller and a cluster for all the pilots. This data is used to initialize a Gaussian mixture model for the controller and the pilots. An Expectation-Maximization (EM) algorithm is then used to iteratively classify the segments as controller or pilot, and re-estimate the mixture models. After convergence, a dynamic programming algorithm is used to improve classification by taking into account speaker duration.
The method of the present invention offers several improvements over Siu et al. As noted above, hierarchical clustering using tied Gaussian mixtures gives better results, as does recomputing the distances. Second, the present use of hidden Markov modeling allows the durational constraints to be accounted for during classification, as opposed to using dynamic programming as a post-processor. Third, the present technique of using tied silence models allows silences to be determined during classification stage, rather than as a pre-processor.
Sugiyama et al., "Speech Segmentation and Clustering Based on Speaker Features," Proc. Int. Conf. Acoustics, Speech and Signal Processing, April 1993, vol. 2, pp. 395-398, discuss a method for segmenting speech when the speakers are unknown, but the number of speakers is known. Their speaker models consist of a single state HMM. Iterative resegmentation is performed, where the speaker models are retrained and the segmentation re-estimated. However, their method has several drawbacks. First, their speaker models are initialized randomly, a technique which is known to produce variable results. Our method describes robust initialization of speaker models. Second, silence is not estimated. And third, the single state speaker HMMs are not as robust as multistate HMMs.
Matsui et. al., "Comparison of Text-Independent Speaker Recognition Methods Using VQ-Distortion and Discrete/Continuous HMMs," Proc. Int. Conf. Acoustics, Speech and Signal Processing, March 1992, vol. 2, pp. 157-160 compare speaker identification methods using HMM speaker models and vector quantization (VQ). However, they do not teach segmentation of speech from multiple speakers.
In the present invention, Hidden Markov Models (HMMs) are used to model individual speakers. Speaker models consist of multiple state HMMs with Gaussian output distributions, and a tied silence model. Such HMMs may be initially trained using Baum-Welch procedures when speakers are known and training data is available. Alternatively, individual HMMs may be initialized by first performing agglomerative hierarchical cluster techniques using likelihood distances for an initial segmentation of the speech waveform, and using the initial segmentation to train individual speaker HMMs. The speaker HMMs can then be iteratively retrained as described below.
Networks of HMMs are created to model conversational speech including numerous speakers. Using the HMM network, the audio stream is segmented based on the most likely sequence of states through the network. This segmentation may be done in real-time, the segment information being correlated to and stored in conjunction with the audio stream even as it is being created and recorded. In post-recording operation, subsequent retraining of the models and resegmenting of the audio stream may be performed, iterations continuing while changes in segmentation occur from improved models.
When the segmentation is completed, the audio stream is accompanied by an audio index, segregating the audio stream into utterances according to individuals. Non-speech sounds, such as ringing telephones, may also be detected and segmented.
It is an object of the present invention to provide an improved method of performing unsupervised estimation of initial segmentation of an audio stream containing multiple speakers.