This invention relates generally to methods and apparatus for use in performing speaker identification.
In systems that provide for identification of a speaker, a general technique is to score the speaker""s enunciation of a test phrase against each one of a number of individual Gaussian mixture models (GMM) and to select, or identify, the speaker as that person associated with the individual GMM, or set of GMMs, achieving the best score above a certain threshold using, e.g., a maximum likelihood technique. Typically, these systems generate individual GMMs by independently training, a priori, on small (e.g., 30 milli-second (ms.)) speech samples of training phrases spoken by the respective person.
Unfortunately, such systems do not perform well when attempting to discriminate the true speaker from people that merely sound like the true speaker. As such, in an attempt to improve discrimination these systems increase the number of GMMs to include xe2x80x9ccohortxe2x80x9d or xe2x80x9cbackgroundxe2x80x9d models, i.e., people that sound like the true speaker but are not (e.g., see Herbert Gish and Michael Schmidt, xe2x80x9cText-independent speaker identification,xe2x80x9d IEEE Signal Processing Magazine, pages 18-32, 1994).
Alternatively, for both the speech and speaker recognition problems, a different approach has recently been proposed which uses a discriminative cost finction (which measures the empirical risk) during training in place of the maximum likelihood estimation, giving significantly improved generalization performance (e.g., see, Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee, xe2x80x9cMinimum Classification Error Rate Methods for Speech Recognition,xe2x80x9d IEEE Transactions on Speech and Audio Processing, 5(3):257-265, 1997; and Chi-Shi Lui Chin-Hui Lee, Wu Chou, Biing-Hwang Juang, and Aaron E. Rosenberg, xe2x80x9cA study on minimum error discriminative training for speaker recognition,xe2x80x9d Journal of the Acoustical Society of America, 97(1):637-648, 1995). However, here the underlying model (a set of hidden Markov models) is left unchanged, and in the speaker recognition case, only the small vocabulary case of isolated digits was considered.
In providing speaker identification systems such as described above, support vector machines (SVMs) have been used for the speaker identification task directly, by training one-versus-rest and one-versus-another classifiers on the preprocessed data (e.g., see M. Schmidt, xe2x80x9cIdentifying speaker with support vector networks,xe2x80x9d Interface ""96 Proceedings, Sydney, 1996). However, in such SVM-based speaker identification systems, training and testing are both orders of magnitude slower than, and the resulting performance is similar to, that of competing systems (e.g., see also, National Institute for Standards and Technology, Speaker recognition workshop, Technical Report, Maritime Institute of Technology, Mar. 27-28, 1996).
Unfortunately, the above-described approaches to speaker-identification are not inherently discriminative, in that a given speaker""s model(s) are trained only on that speaker""s data, and effective discrimination relies to a large extent on finding effective score normalization and thresholding techniques. Therefore, I have developed an alternative approach that adds explicit discrimination to the GMM method. In particular, and in accordance with the invention, I have developed a way to perform speaker identification that uses a single Gaussian mixture model (GMM) for multiple speakerxe2x80x94referred to herein as a Discriminative Gaussian mixture model (DGMM).
In an illustrative embodiment of the invention, a DGMM comprises a single GMM that is used for all speakers. A likelihood sum of the GMM is factored into two parts, one of which depends only on the Gaussian mixture model, and the other of which is a discriminative term. The discriminative term allows for the use of a binary classifier, such as a support vector machine (SVM).
In another embodiment of the invention, a voice messaging system incorporates a DGMM. The voice messaging system comprises a private branch exchange (PBX) and a plurality of user terminals, e.g., telephones, personal computers, etc.