1. Field of the Invention
This invention relates generally to speaker recognition and more particularly to text-independent speaker identification over a large population.
2. Description of the Related Art
Speaker recognition technology has many uses including data mining and metadata for automated court, meeting, and lecture transcriptions. Speaker recognition technology can be used in real-time in concert with speech recognition to enhance broadcast closed-captioned features by providing the identity of the speaker along with the actual words spoken.
Traditional speech recognition and speaker verification use similar analysis tools to achieve their goals. An input utterance is first processed to determine its essential characteristics. Typically, input utterances are converted to cepstral coefficients. A cepstrum is an inverse Fourier transform of the log power spectrum. The cepstral coefficients for training utterances are saved in a training phase. This training set may consist only of short utterances of passwords for speaker verification systems to extensive vocabulary recitals for speech recognition. In speech recognition, an input utterance is compared to models based on saved training data to determine which utterance is most similar. In a generalized speech recognition system the saved training information must be a generalized representation of many people""s way of forming an utterance while in speaker verification the training information represents the individual characteristics of the speaker and the verification system tries to determine if an authorized person""s input utterance is sufficiently close to the training data and can be distinguished from an impostor""s utterances. As a result, the training in a speaker verification system emphasizes individual characteristics, while in a speech recognition system the characteristics are generalized over many individual speakers.
Speaker recognition systems determine the identity of a conversational speaker. They typically do not have the advantage of being able to compare a short spoken password to the same short training data. In many ways speaker recognition systems share many of the attributes of speech recognition systems: the input samples and training data are not limited to simple verification phrases, there is little a priori knowledge of the input, the pace of speech is uncontrolled, and environmental effects, including the characteristics of the transmission device and media (telephone circuit, radio or broadcast channel), are unknown.
For example, one technique for speaker recognition would be to use speech recognition to capture the exact sequence of phones, examine the acoustic phonetic details of different speakers producing the same sounds and sequences of sounds, and compare these details across speakers or score them for each speaker against a model.
As an extreme example, given speakers A, B, and C, where speaker A lisps and speaker B stutters; then given perfect recognition of a large enough sample of speech by all three, the acoustic scores of the [s] and [sh] sounds might distinguish A from B and C, and either the acoustic scores or the Hidden Markov Model (HMM) path traversed by the initial stop consonants, for example, might distinguish B from C and A.
A problem with this approach is that speech recognizers are usually optimized for the recognition of words, not of phones; use word n-gram statistics to guide their decisions; and train their acoustic processing, model topologies, and time alignment to ignore speaker differences.
These very difficulties in speech recognition systems are in essence the differences in speakersxe2x80x94the differences that can be used for speaker recognition. Most practical methods of speaker recognition, and especially those with very limited training, are based on differences in broadly defined voice quality, rather than on these speaker differences. Individual speakers use different inventories of phones, or speech sounds. These cumulative differences in a speaker""s pronunciation, represented by phones, can be exploited to recognize a speaker. Previous methods have used sequences of phonotactic constrained phonemes to extract and cluster acoustic features to recognize speakers or languages. There is a distinction between phonemes, defined by a language, as given by the dictionary, and phones, the actual pronunciation, as given by the acoustics. This invention exploits cumulative phone differences to identify the speaker. The dynamics of pronunciation contribute to human recognition of speakers, however, exploiting such information automatically is difficult because, in principle, comparisons must be made between different speakers saying essentially the same things.
What is needed is a tool that will consistently recognize and classify as many phonetic states as possible, regardless of their linguistic roles ( i.e., what words are being spoken), using sufficiently sensitive acoustic measurements, so that comparisons can be made among different speakers"" realizations of the same speech gestures.
In consideration of the problems detailed above and the limitations enumerated in the partial solutions thereto, an object of the present invention is to provide an improved speaker recognition system using the phonetics of pronunciation.
Another object of the instant invention is to provide a speaker recognition system that operates independently of the text spoken, sex of the speaker, language used by the speaker, and audio characteristics of the input channel.
Another object of the instant invention is to provide a speaker recognition system that is not wholly dependent on acoustic measurements but relies on higher level speaker information.
Still another object of the instant invention is to provide a speaker recognition system that uses parallel streams of highly uncorrelated multi-lingual features.
Still another object of the instant invention is to provide a speaker recognition system that can exploit the long-term speaker characteristics associated with large amounts of training data.
Yet another object of the instant invention is to provide a speaker recognition system that makes use of n-gram modeling.
In order to attain the objectives described above, according to an aspect of the present invention, there is provided a method of and device for phone-based speaker recognition whereby speakers may be recognized and distinguished based on higher level speaker information and the cumulative differences of phonetic features of their voices.
The instant invention is a speaker-recognition system based only on phonetic sequences, instead of the traditional acoustic feature vectors. Although the phones are detected using acoustic feature vectors, the speaker recognition is performed strictly from the phonetic sequence created by the phone recognizer(s). Speaker recognition is performed using the outputs of one or more recognizers each trained on a different linguistic characteristic. Recognition of the same speech sample by these recognizers constitutes different views of the phonetic states and state sequences uttered by the speaker.
The instant system performs phonetic speaker recognition in four steps. First, a phone recognizer processes test speech utterance to produce phone sequences. Then a test speaker model is generated using phone n-gram counts. Next, the test speaker model is compared to the hypothesized speaker models and a background model. Finally, the scores from the hypothesized speaker models and background model are combined to form a single recognition score.
In some embodiments multiple-language phone sequences are accommodated by incorporating phone recognizers trained on several languages resulting in a matrix of hypothesized speaker models and background models.