Speaker recognition is a tool that has been increasingly deployed in a variety of applications, particularly in telecommunications. For example, speaker recognition is currently utilized for voice mail access, telephone banking, and calling cards. Speaker recognition provides the ability to recognize a person automatically from his or her voice.
A person's voice may be characterized by acoustic features that vary depending on the state of physiological attributes of the speaker. For example, as illustrated in FIG. 1, as the speaker 10 changes the shape of his or her vocal tract to produce a certain sound, the acoustic features of the speaker's voice varies with the vocal state 15a, . . . , 15d. Feature vectors x1, . . . , xT represent acoustic features in D-dimensions that are extracted from a speech waveform and which vary with vocal state.
Acoustic features that are effective for automatic speaker recognition and easily extracted from a speech waveform are frequency spectra. FIG. 2 is a diagram illustrating frequency spectra produced by different vocal states. As illustrated, vocal states 15b, 15c produce frequency spectra 17b, 17c, respectively, that differ in number, magnitude, and frequency of spectral frequency components (i.e., formant peaks). Different speakers generally produce different frequency spectra 17b, 19b for the same state 15b. Other features known to those skilled in the art may also be employed, such as volume, clarity, speed, and pitch.
FIG. 3A is a graph illustrating distributions of frequency spectra produced by different vocal states for a speaker. In this example, each plotted point corresponds to a formant peak in two dimensions, amplitude A and frequency F. As shown, repeated tests of each vocal state typically produces a cluster of frequency spectra 20a, 20b, 20c, and 20d. Clustering occurs because (i) speech production is not deterministic (i.e., a sound produced twice is generally not exactly the same) and (ii) spectra produced from a particular vocal-tract shape can vary widely due to coarticulation effects. Thus, automatic speaker recognition systems typically employ statistical speaker models to represent the acoustic features of a person's voice.
As illustrated in FIG. 3B, the distribution of frequency spectra produced from a particular vocal state may be modeled according to a multidimensional Gaussian probability density function (pdf). Thus, an individual speaker model may be implemented as a Gaussian mixture model (GMM). A GMM speaker model is a weighted summation of Gaussian pdf's 25a, . . . , 25d with state-dependent mean vectors μs(i), variance vectors ∂s(i), and mixture weights ωs(i) for states i=1, . . . , M. Each state in a GMM speaker model models the distribution of acoustic features associated with a particular vocal state.
FIG. 4 is a diagram illustrating a background speaker model which is generated from a collective group of speakers. A background model 65 may be generated for a specific class of individuals 50, for example, a gender-based class. As with individual speaker models, background speaker models may be represented as a weighted summation of Gaussian probability density functions with a state dependent mean vectors μB(i), variance vectors ∂B(i), and mixture weights ωB(i) for states i=1, . . . , M. Each state in the background speaker model 65 models the distribution of acoustic features associated with a particular vocal state 55a, . . . , 55d produced by a class of speakers 50.
Background speaker models are generally used in speaker verification applications to determine whether the speaker is the person he or she claims to be. Particular speaker verification systems implement a likelihood ratio test in which an input utterance is applied to a speaker model representing the claimed speaker and to a background model representing a certain class of non-claimed speakers. Each model computes a probability indicating the “likelihood” of the input utterance being produced by the claimed speaker or not. The determination of whether the speaker is verified depends on the ratio of the probabilities meeting a certain threshold value.
For more details regarding the implementation of individual and background speaker models using Gaussian mixture models, refer to D. A. Reynolds, “Automatic Speaker Recognition Using Gaussian Mixture Speaker Models,” The Lincoln Laboratory Journal, vol. 8, no. 2, pp. 173-192, 1995, the entire contents of which are incorporated herein by reference.