1. Field of the Invention
The present invention relates to a system and method for adapting speaker verification models to achieve enhanced performance during verification and particularly, to a subword based speaker verification system having the capability of adapting a neural tree network (NTN), Gaussian mixture model (GMM), dynamic time warping template (DTW), or combinations of the above, without requiring additional time consuming retraining of the models.
The invention relates to the fields of digital speech processing and speaker verification.
2. Description of the Related Art
Speaker verification is a speech technology in which a person""s identity is verified using a sample of his or her voice. In particular, speaker verification systems attempt to match the voice of the person whose identity is undergoing verification with a known voice. It provides an advantage over other security measures such as personal identification numbers (PINs) and personal information, because a person""s voice is uniquely tied to his or her identity. Speaker verification provides a robust method for security enhancement that can be applied in many different application areas including computer telephony.
Within speaker recognition, the two main areas are speaker identification and verification. A speaker identification system attempts to determine the identify of a person within a known group of people using a sample of his or her voice. In contrast, a speaker verification system attempts to determine if a person""s claimed identity (whom the person claims to be) is valid using a sample of his or her voice.
Speaker verification consists of determining whether or not a speech sample provides a sufficient match to a claimed identity. The speech sample can be text dependent or text independent. Text dependent speaker verification systems verify the speaker after the utterance of a specific password phrase. The password phrase is determined by the system or by the user during enrollment and the same password is used in subsequent verification. Typically, the password phrase is constrained within a fixed vocabulary, such as a limited number of numerical digits. The limited number of password phrases gives the imposter a higher probability of discovering a person""s password, reducing the reliability of the system.
A text independent speaker verification system does not require that the same text be used for enrollment and testing as in a text dependent speaker verification system. Hence, there is no concept of a password and a user will be recognized regardless of what he or she speaks.
Speech identification and speaker verification tasks may involve large vocabularies in which the phonetic content of different vocabulary words may overlap substantially. Thus, storing and comparing whole word patterns can be unduly redundant, since the constituent sounds of individual words are treated independently regardless of their identifiable similarities. For these reasons, conventional vocabulary speech recognition and text-dependent speaker verification systems build models based on phonetic subword units.
Conventional approaches to performing text-dependent speaker verification include statistical modeling, such as hidden Markov models (HMM), or template-based modeling, such as dynamic time warping (DTW) for modeling speech. For example, subword models, as described in A. E. Rosenberg, C. H. Lee ad F. K. Soong, xe2x80x9cSubword Unit Talker Verification Using Hidden Markov Modelsxe2x80x9d, Proceedings ICASSP, pages 269-272 (1990) and whole word models, as described in A. E. Rosenberg, C. H. Lee and S. Gokeen, xe2x80x9cConnected Word Talker Recognition Using Whole Word Hidden Markov Modelsxe2x80x9d, Proceedings ICASSP, pages 381-384 (1991) have been considered for speaker verification and speech recognition systems. HMM techniques have the limitation of generally requiring a large amount of data to sufficiently estimate the model parameters.
Other approaches include the use of Neural Tree Networks (NTN). The NTN is a hierarchical classifier that combines the properties of decision trees and neural networks, as described in A. Sankar and R. J. Mammone, xe2x80x9cGrowing and Pruning Neural Tree Networksxe2x80x9d, IEEE Transactions on Computers, C-42:221-229, Mar. 1993. For speaker recognition, training data for the NTN consists of data for the desired speaker and data from other speakers. The NTN partitions feature space into regions that are assigned probabilities which reflect how likely a speaker is to have generated a feature vector that falls within the speaker""s region.
The above described modeling techniques rely on speech being segmented into subwords. Modeling at the subword level expands the versatility of the system. Moreover, it is also conjectured that the variations in speaking styles among different speakers can be better captured by modeling at the subword level. Traditionally, segmentation and labeling of speech data was performed manually by a trained phonetician using listening and visual cues. However, there are several disadvantages to this approach, including the time consuming nature of the task and the highly subjective nature of decision-making required by these manual processes.
One solution to the problem of manual speech segmentation is to use automatic speech segmentation procedures. Conventional automatic speech segmentation processing has used hierarchical and nonhierarchical approaches.
Hierarchical speech segmentation involves a multi-level, fine-to-course segmentation which can be displayed in a tree-like fashion called dendogram. The initial segmentation is a fine level with the limiting case being a vector equal to one segment. Thereafter, a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
Non-hierarchical speech segmentation attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score metric. The techniques for hierarchical and non-hierarchical speech segmentation have the limitation of needing prior knowledge of the number of speech segments and corresponding segment modules.
A technique not requiring prior knowledge of the number of clusters is defined as xe2x80x9cblindxe2x80x9d clustering. This method is disclosed in U.S. patent application Ser. No. 08/827,562 entitled xe2x80x9cBlind Clustering of Data With Application to Speech Processing Systemsxe2x80x9d, filed on Apr. 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled xe2x80x9cBlind Speech Segmentationxe2x80x9d, filed on Apr. 2, 1996, both of which are herein incorporated by reference. In blind clustering, the number of clusters is unknown when the clustering is initiated. In the aforementioned application, an estimate of the range of the minimum number of clusters and maximum number of clusters of a data sample is determined. A clustering data sample includes objects having a common homogeneity property. An optimality criterion is defined for the estimated number of clusters. The optimality criterion determines how optimal the fit is for the estimated number of clusters to the given clustering data samples. The optimal number of clusters in the data sample is determined from the optimality criterion. The speech sample is segmented based on the optimal boundary locations between segments and the optimal number of segments.
The blind segmentation method can be used in text-dependent speaker verification systems. The blind segmentation method is used to segment an unknown password phrase into subword units. During enrollment in the speaker verification system, the repetition of the speaker""s password is used by the blind segmentation module to estimate the number of subwords in the password and locate optimal subword boundaries. For each subword segment of the speaker, a subword segmentator model, such as a neural tree network or a Gaussian mixture model can be used to model the data of each subword.
Further, there are many multiple model systems that combine the results of different models to further enhance performance.
One critical aspect of any of the above-described speaker verification systems that can directly affect its success is robustness to intersession variability and aging. Intersession variability refers to the situation where a person""s voice can experience subtle changes when using a verification system from one day to the next. A user can anticipate the best performance of a speaker verification system when performing a verification immediately after enrollment. However, over time the user may experience some difficulty when using the system. For substantial periods of time, such as several months to years, the effects of aging may also degrade system performance. Whereas the spectral variation of a speaker may be small when measured over a several week period, as time passes this variance will grow as described in S. Furui, xe2x80x9cComparison of Speaker Recognition Methods using Statistical Features and Dynamic Featuresxe2x80x9d, IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-29:342-350, pages 342-350, April 1981. For some users, the effects of aging may render the original voice model unusable.
What is needed is a adaptation system and method for speaker verification systems, and in particular for discriminant and multiple model-based models, that requires minimal computational and storage resources. What is needed is an adaptation system that compensates for the effects of intersession variability and aging.
Briefly described, the present invention relates to new model adaptation schemes for speaker verification systems. Model adaptation changes the models learned during the enrollment component dynamically over time, to track aging of the user""s voice. The speaker adaptation system of the present invention has the advantage of only requiring a single enrollment for the speaker. Typically, if a person is merely enrolled in a single session, performance of the speaker verification system will degrade due to voice distortions as a consequence of the aging process as well as intersession variability. Consequently, performance of a speaker verification system may become so degraded that the speaker is required to re-enroll, thus, requiring the user to repeat his or her enrollment process. Generally, this process must be repeated every few months.
With the model adaptation system and method of the present invention, re-enrollment sessions are not necessary. The adaptation process is completely transparent to the user. For example, a user may telephone into his or her xe2x80x9cPrivate Branch Exchangexe2x80x9d to gain access to an unrestricted outside line. As is customary with a speaker verification system, the user may be requested to state his or her password. With the adaptation system of the present invention, this one updated utterance can be used to adapt the speaker verification model. For example, every time a user is successfully verified, the test data may be considered as enrollment data, and the models trained and modeled using the steps following segmentation. If the password is accepted by the system, the adapted system uses the updated voice features to update the particular speaker recognition model almost instantaneously. Model adaptation effectively increases the number of enrollment samples and improves the accuracy of the system.
Preferably, the adaptation schemes of the present invention can be applied to several types of speaker recognition systems including neural tree networks (NTN), Gaussian Mixture Models (GMMs), and dynamic time warping (DTW) or to multiple models (i.e., combinations of NTNs, GMMs and DTW). Moreover, the present invention can be applied to text-dependent or text-independent systems.
For example, the present invention provides an adaptation system and process that adapts neural network tree (NTN) modules. The NTN is a hierarchical classifier that combines the properties of decision trees and feed-forward Neural Networks. During initial enrollment, the neural tree network learns to distinguish regions of feature space that belong to the target speaker from those that are more likely to belong to an imposter. These regions of feature space correspond to xe2x80x9cleavesxe2x80x9d in the neural tree network that contain probabilities. The probabilities represent the likelihood of the target speaker having generated data that falls within that region of feature space. Speaker observations within each region are determined by the number of xe2x80x9ctarget vectorsxe2x80x9d landing within the region. The probability at each leaf of the NTN is computed as the ratio of speaker observations to total observations encountered at that leaf during enrollment.
During the adaptation method of the present invention, the number of targeted vectors, or speaker observations, is updated based on the new utterance at a leaf. Each vector of the adaptation utterance is applied to the NTN and the speaker observation count of the leaf that the vector arrives is incremented. By maintaining the original number of speaker observations and imposter observations at each leaf, the probability can be updated in this manner. The probabilities are then computed with new leaf counts. In this manner, the discriminant model can be updated to offset the degraded performance of the model due to aging and intersession variability.
In another embodiment of the present invention, statistical models such as a Gaussian mixture model (GMM) can be adapted based on new voice utterances. In the GMM, a region of feature space for a target speaker is represented by a set of multivariate Gaussian distributions. During initial enrollment, certain component distribution parameters are determined including the mean, covariance and mixture weights corresponding to the observations. Essentially, each of these parameters is updated during the adaptation process based on the added observations obtained with the updated voice utterance. For example, the mean is updated by first scaling the mean by the number of original observations. This value is then added to a new mean based on the updated utterance, and the sum of these mean values is divided by the total number of observations. In a similar manner, the covariance and mixture weights can also be updated.
In another embodiment of the present invention, template-based approaches, such as dynamic time warping (DTW), can be adapted using new voice utterances. Given a DTW template that has been trained with the features for N utterances, the features for a new utterance can be averaged into this template. For example, the data for the original data template can be scaled by multiplying it by the number of utterances used to train it, or in this case, N. The data for the new utterance is then added to this scaled data and then the sum is divided by the new number of utterances used in the model, N+1. This technique is very similar to that used to update the mean component of the Gaussian mixture model.
Although not necessary, the adaptive modeling approach used in the present invention is preferably based on subword modeling for the NTN and GMM models. The adaptation method occurs during verification. For adapting the DTW template, it is preferred that whole-word modeling be used. As part of verification, features are first extracted for an adaptation utterance according to any conventional feature extraction method. The features are then matched, or xe2x80x9cwarpedxe2x80x9d, onto a DTW template. This provides 1) a modified set of features that best matches the DTW template and 2) a distance, or xe2x80x9cdistortionxe2x80x9d, value that can be used as a measurement for speaker authenticity. The modified set of features output by the DTW warping has been found to remedy the negative effects of noise or speech that precedes or follows a spoken password. At this point, the warped features are used to adapt the DTW template.
Next, the feature data is segmented into sub-words for input into the NTN and GMM models. While several types of segmentation schemes can be used with the present invention, including hierarchical and nonhierarchical speech segmentation schemes, it is preferred that the spectral features be applied to a blind segmentation algorithm, such as that disclosed in U.S. patent application Ser. No. 08/827,562 entitled xe2x80x9cBlind Clustering of Data With Application to Speech Processing Systemsxe2x80x9d, filed on Apr. 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled xe2x80x9cBlind Speech Segmentationxe2x80x9d, filed on Apr. 2, 1996, both of which are herein incorporated by reference. During enrollment in the speaker verification system, the repetition in the speaker""s voice is used by the blind segmentation module to estimate the number of subwords in the password, and to locate the optimal subword boundaries.
The data at each sub-word is then modeled preferably with a first and second modeling module. For example, the first modeling module can be a neural tree network (NTN) and the second modeling module can be a Gaussian mixture model (GMM). In this embodiment, the adaptive method and system of the present invention is applied to both of these subword models individually in addition to the DTW template to achieve enhanced overall performance.
The outputs of these models, namely the NTN, GMM and DTW, are then combined, according to any one of several multiple model combination algorithms known in the art, to make a decision with respect to the speaker.
The resulting performance after adaptation is comparable to that obtained by retraining the model with the addition of new speech utterances. However, while retraining is time-consuming, the adaptation process, can conveniently be performed following a verification, while consuming minimal computational resources. Further, the adaptation is transparent to the speaker. An additional benefit of adaptation is that the original training data does not need to be stored, which can be burdensome for systems deployed within large populations.
The invention can be used with a number of other adaptation techniques, in addition to the model adaption described and claimed herein. These techniques include fusion adaption, channel adaption and threshold adaption.
The invention will be more fully described by reference to the following drawings.