There are many instances where data includes data of different classes within a particular category (e.g., speech, images, text). In speech applications, examples of multi-class data include data of a speaker of interest and other speakers, speech from both genders, speech in multiple languages, and so on. Prior art methods model a particular class of data using either data of just the particular class or data of the particular class that is contaminated with data from other classes. In some instances, it might be impossible to acquire data of just the class of interest, and models generated from contaminated data are less effective than models generated using data from only a class of interest. So, there is a need to model a user-definable class of data using data that includes the user-definable class of data and at least one other class of data, where the contaminating effects of the classes of data that are not of interest are eliminated. The present invention is just such a method.
Speech sounds result when air from the lungs is forced through the glottis, producing resonances in the vocal tract, which occur at different frequencies. Locations of peaks in the magnitude spectrum of a speech sound, called formants, occur at the resonant frequencies of the vocal tract and are related to the configuration of the vocal tract at the time the speech sound was produced. Differences exist between the vocal tracts of different people, resulting in differences in spectral location and magnitude of formants from speaker to speaker. These differences are used for various speech processing applications (e.g., speech recognition, gender identification, speaker recognition, language identification). Popular statistical modeling methods for speech processing applications include the Hidden Markov Model (HMM), the Buried Markov Model (BMM), and the Gaussian Mixture Model (GMM).
Prior art methods of speaker recognition typically involve a modeling process in which a speaker of interest provides a high quality sample of speech, where the sample is not corrupted by the speech of another person or by noise. Next, a statistical model is generated from the sample using one of the available statistical modeling methods. The model may then be used to determine whether or not the speaker of interest uttered a subsequently provided high quality speech sample. This sample of unknown origin is scored against the model generated for the speaker of interest. If the sample of unknown origin scores sufficiently high against the model then the sample of unknown origin is declared to have been spoken by the speaker of interest. Such a method works fine when the speakers voluntarily provide pristine speech samples. However, there are many applications where a speaker of interest would not voluntarily provide a speech sample, where a speech sample obtained is not pristine, but contaminated with either noise or speech from one or more other speakers. When the speech sample is contaminated with speech from other speakers, the prior art method described here does not provide an optimal solution.
In an article entitled “An Unsupervised, Sequential, Learning Algorithm for the Segmentation of Speech Waveforms with Multiple Speakers,” by Man-Hung Siu et al., IEEE, 1992, pps. 189-192, a method was disclosed for segmenting speech containing multiple speakers into segments of individual speakers without any prior knowledge (i.e., no training data). The method includes the steps of generating a spectral representation of the speech signal in order to identify acoustic segments within the speech signal. Next, a mean and outer product of cepstral vectors are computed for each acoustic segment identified. Next, acoustic segments that represent noise are identified based on the amount of energy present in each segment. Next, the contiguous speech segments are grouped into larger segments. Next, a distance matrix is formed for each acoustic segment. Next, the segments are clustered according to their distance matrix, where each cluster represents an individual speaker. The present method does not identify noise segments, group contiguous segments, or cluster segments as does Siu et al.
U.S. Pat. No. 5,598,507, entitled “METHOD OF SPEAKER CLUSTERING FOR UNKNOWN SPEAKERS IN CONVERSATIONAL AUDIO DATA,” discloses a method of identifying each speaker in data containing multiple speakers by dividing the data into segments, determining the distance between each pair of segments, combining segments between which there is no significant distance, declaring the remaining separate segments to each contain speech from only one speaker, and forming a model of each separate segment to represent each of the individual speakers. The present invention does not identify each unique speaker in speech data as does U.S. Pat. No. 5,598,507. U.S. Pat. No. 5,598,507 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,623,539, entitled “USING VOICE SIGNALS ANALYSIS TO IDENTIFY AUTHORIZED USERS OF A TELEPHONE SYSTEM,” discloses a device for and method of determining if an authorized user is using a telephone by separating speech from noise in a multi-party speech signal. Then, the speech of each speaker is separated according to the method of Siu et al. described above. Then, each person's speech data is compared to speech data of authorized users. If an authorized user is detected then the conversation is allowed to proceed. Otherwise, corrective action is taken. The present invention does not use the method of Siu et al. to identify each speaker in speech data. U.S. Pat. No. 5,623,539 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,659,662, entitled “UNSUPERVISED SPEAKER CLUSTERING FOR AUTOMATIC SPEAKER INDEXING OF RECORDED AUDIO DATA,” discloses a device for and a method of identifying each speaker in data containing multiple speakers by dividing the data into segments, determining the distance between each pair of segments, combining segments between which there is no significant distance, declaring the remaining separate segments to each contain speech from only one speaker, and forming a model of each separate segment to represent each of the individual speakers. The present invention does not identify each unique speaker in speech data as does U.S. Pat. No. 5,659,662. U.S. Pat. No. 5,659,662 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,332,122, entitled “TRANSCRIPTION SYSTEM FOR MULTIPLE SPEAKERS USING AND ESTABLISHING IDENTIFICATION,” discloses a device for and method of transcribing the speech of multiple speakers by assigning a speaker identification number to the first speaker. The speech signal is continuously monitored for a change of speaker, using a conventional method of determining whether or not additional speech came from the speaker to which the current identification number was assigned. If the speaker remains the same, the speech is transcribed. If a change of speaker is detected then it is determined whether or not the present speaker has been enrolled or not. If so, the identification number assigned when the speaker enrolled is noted. Otherwise, the speaker is enrolled and assigned an identification number. Speech is then transcribed until another change of speaker is detected. The present invention does not assign an identification number to the first speaker in a signal, continuously monitor the speech for changes in speaker, and take the action described above as does U.S. Pat. No. 6,332,122. U.S. Pat. No. 6,332,122 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,404,925, entitled “METHODS AND APPARATUSES FOR SEGMENTING AN AUDIO-VISUAL RECORDING USING IMAGE SIMILARITY SEARCHING AND AUDIO SPEAKER RECOGNITION,” discloses a device for and method of segmenting an audio-visual presentation by presentation and identifying the number of speakers. It is presumed that only one person speaks during a presentation, but that one person may give more than one presentation. The first step is to segment the recording according to changes in slides presented during the presentation. Next, the audio associated with each identified slide is extracted. The audio segments are clustered to determine the number of speakers. The clusters are then used as training data for segmenting speakers. The present invention does not use video to segment speakers or clustering to determine the number of speakers as does U.S. Pat. No. 6,404,925. U.S. Pat. No. 6,404,925 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,411,930, entitled “DISCRIMINATIVE GAUSSIAN MIXTURE MODELS FOR SPEAKER VERIFICATION,” discloses a device for and method of speaker identification for multiple speakers using a single Gaussian Mixture Model (GMM) that is modified to include an output of a Support Vector Machine (SVM), where the SVM was trained to distinguish each speaker for which the GMM is used. The present invention does not use a GMM modified by a SVM that is trained to identify each speaker for which the GMM is used as does U.S. Pat. No. 6,411,930. U.S. Pat. No. 6,411,930 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,424,935, entitled “TWO-WAY SPEECH RECOGNITION AND DIALECT SYSTEM,” discloses a device for and method of converting speech to text for multiple speakers by recording the speech pattern of the first speaker, storing transcribed speech, and comparing the vocal pattern to dialect records. If a match is found, the vocal pattern is associated with the transcribed speech and dialect record. The present invention does not compare vocal patterns to dialect records as does U.S. Pat. No. 6,424,935. U.S. Pat. No. 6,424,935 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,535,848, entitled “METHOD AND APPARATUS FOR TRANSCRIBING MULTIPLE FILES INTO A SINGLE DOCUMENT,” discloses a device for and method of transcribing speech from multiple speakers into multiple files by recording each speaker using a separate recording device for each speaker. The present invention does not identify each speaker by using a separate recording device for each speaker as does U.S. Pat. No. 6,535,848. U.S. Pat. No. 6,535,848 is hereby incorporated by reference into the specification of the present invention.