The invention directed to an automatic speaker verification (ASV) system and method useful for storing and processing voice signals to automatically ascertain the identity of an individual.
1. Field of the Invention
The invention relates to the fields of digital speech processing and speaker recognition.
2. Description of Related Art
In many situations it is desired to verify the identity of a person, such as a consumer. For example, in credit card transactions, it is important to confirm that a consumer presenting a credit card (or credit card number) to a merchant is authorized to use the credit card. Currently, the identity of the consumer is manually verified by the merchant. The back of the credit card contains a signature strip, which the consumer signs upon credit card issuance. The actual signature of the consumer at the time of sale is compared to the signature on the back of the credit card by the merchant. If in the merchant""s judgement, the signatures match, the transaction is allowed to proceed.
Another systems of the prior art includes placing a photograph of an authorized user on the credit card. At the time of the transaction, the merchant compares the photograph on the card with the face of the person presenting the card. If there appears to be a match, the transaction is allowed to proceed.
However, these prior art methods have serious drawbacks. These systems are manual and consequently prone to human error. Signatures are relatively easy to forge and differences between signatures and photographs may go unnoticed by inattentive merchants. Further, these systems cannot be used with credit card transactions which do not occur in person, for example, transactions which occur via telephone.
Voice verification systems, sometimes known as automatic speaker verification (ASV) systems, attempt to cure the deficiencies of these prior art methods. These systems attempt to match the voice of the person whose identity is undergoing verification with a known voice.
One type of voice recognition system is a text-dependent automatic speaker verification system. The text-dependent ASV system requires that the user speak a specific password or phrase (the xe2x80x9cpasswordxe2x80x9d). This password is determined by the system or by the user during enrollment. However, in most text-dependent ASV systems, the password is constrained to be within a fixed vocabulary, such as a limited number of numerical digits. The limited number of password phrases gives an imposter a higher probability of discovering a person""s password, reducing the reliability of the system.
Other text-independent ASV systems of the prior art utilize a user-selectable password. In such systems, the user enjoys the freedom to make-up his/her own password with no constraints on vocabulary words or language. The disadvantage of these types of systems is that they increase the processing requirement of the system because it is much more technically challenging to model and verify a voice pattern of an unknown transcript (i.e. a highly variable context).
Modeling of speech has been done at the phrase, word, and subword level. In recent years, several subword-based speaker verification systems have been proposed using either Hidden Markov Models (xe2x80x9cHMMxe2x80x9d) or Artificial Neural Network (xe2x80x9cANNxe2x80x9d) references. Modeling at the subword level expands the versatility of the system. Moreover, it is also conjectured that the variations in speaking styles among different speakers can be better captured by modeling at the subword level.
Another challenge posed under real-life operating environments is that noise and background speech/music may be detected and considered as part of the password. Other problems with transmission or communications systems is that channel-specific distortion occurs over channels, such as transducers, telephone lines and telephone equipment which connect users to the system. Further, ASV systems using modeling need to adapt to changes in the user and to prior successful and unsuccessful attempts at verification.
What is needed are reliable systems and methods for automatic speaker verification of user selectable phrases.
What is needed is a user-selectable ASV system in which accuracy is improved over prior ASV systems.
What is needed is a word or phrase detector which can identify key portions of spoken password phrases over background noise.
What is needed is channel adaptation to adapt a system in response to signals received over different channels.
What is needed is fusion adaptation to adapt a system in response to previous errors and successes.
What is needed is threshold adaptation to adapt a system in response to previous errors and successes.
What is needed is model adaptation to adapt underlying a system model components in response to previous successes.
The voice print system of the present invention builds and improves upon existing ASV systems. The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language. Automatic blind speech segmentation allows speech to be segmented into subword units without any linguistic knowledge of the password. Subword modeling is performed using a discriminant training-based classifier, namely a Neural Tree Network (NTN). The present NTN is a hierarchical classifier that combines the properties of decision trees and feed-forward neural networks. The system also takes advantage of such concepts as multiple classifier fusion and data resampling to successfully boost performance.
Key word/key phrase spotting is used to optimally locate the password. Channel adaptation removes the nonuniform effects of different environments which lead to varying channel characteristics, such as distortion. Channel adaptation is able to remove the characteristics of the test channel and/or enrollment channel to increase accuracy.
Fusion adaptation is used to dynamically change the weight accorded to the individual classifier models, which increases the flexibility of the system. Threshold adaptation dynamically alters the threshold necessary to achieve successful verification. Threshold adaptation is useful to incrementally change false-negative results. Model adaptation gives the system the capability to retrain the classifier models upon the occurrence of subsequent successful verifications.
The voice print system can be employed for user validation for telephone services such as cellular phone services and bill-to-third-party phone services. It can also be used for account validation for information system access.
All ASV systems include at least two components, an enrollment component and a testing component. The enrollment component is used to store information concerning a user""s voice. This information is then compared to the voice undergoing verification (testing) by the test component. The system of the present invention includes inventive enrollment and testing components, as well as a third, xe2x80x9cbootstrapxe2x80x9d component. The bootstrap component is used to generate data which assists the enrollment component to model the user""s voice.
1. Enrollment Summary
An enrollment component is used to characterize a known user""s voice and store the characteristics in a database, so that this information is available for future comparisons. The system of the present invention utilizes an improved enrollment process. During enrollment, the user speaks the password, which is sampled by the system. Digital to analog conversion (if necessary) is conducted to obtain digital speech samples. Preprocessing is performed to remove unwanted silence and noise from the voice sample, and to indicate portions of the voice sample which correspond to the user""s voice.
Next, the transmission channel carrying the user""s enrollment voice signal is examined. The characteristics of the enrollment channel are estimated and stored in a database. The database may be indexed by identification information, such as by the user""s name, credit card number, account identifier, etc. . . .
Feature extraction is then performed to extract features of the user""s voice, such as pitch, spectral frequencies, intonations, etc. . . . Feature extraction may also focus, or capture, desired segments of the voice sample and reject other unwanted segments. The feature extraction process generates a number of vectors relating to features of the voice segment. Using the feature vectors, a key word/key phrase reference template may be generated and stored in a voice print database. The reference template is used during testing to locate the spoken password from extraneous speech or noise.
Next, segmentation of the voice segment occurs. Segmentation preferably occurs via automatic blind speech segmentation techniques. Alternatively, segmentation may be performed by older manual or semi-automatic techniques. Segmentation divides the voice sample into a number of subwords. The subwords are used in a modeling process.
In recent years, several subword-based speaker verification systems have been proposed. The present invention uses subword modeling and may use any of the known techniques, but preferably uses a discriminant training based classifier. The discriminant training based classifier is called a Neural Tree Network (NTN). The NTN is a hierarchical classifier that combines the properties of decision trees and feed-forward Neural Networks.
The system also utilizes the principles of multiple classifier fusion and data resampling. A multiple classifier system is a powerful solution for robust pattern classification because it allows for simultaneous use of arbitrary feature descriptors and classification procedures. The additional classifier used herein is the Gaussian Mixture Model (GMM) classifier.
In the event that only a small amount of data is available for modeling a speaker, the resulting classifier is very likely to be biased. Data resampling artificially expands the size of the sample pool and therefore improves the generalizations of the classifiers. One of the embodiments of the classifier fusion and data resampling scheme is a xe2x80x9cleave-one-outxe2x80x9d data resampling method.
A fusion function, which is set at a default value and stored in the database, is used to weigh the individual scored classifiers, and to set a threshold value. The threshold value is stored in the database for use in the verification process. Thus, enrollment produces a voice print database containing an index (such as the user""s name or credit card number), along with enrollment channel data, classifier models, feature vectors, segmentation information, multiple trained classifier data, fusion constant, and a recognition threshold.
The threshold is used when a user is undergoing verification (or testing by the test component). A user is verified as the known user when the threshold is reached or exceeded.
2. Test Component Summary
The test component is the component which performs the verification. During testing or verification, the system first accepts xe2x80x9ctest speechxe2x80x9d and index information from a user claiming to be the person identified by the index information. Voice data indexed in the database is retrieved and used to process the test speech sample.
During verification, the user speaks the password into the system. This xe2x80x9ctest speechxe2x80x9d password undergoes preprocessing, as previously described, with respect to the enrollment component. The next step is to perform channel adaptation.
Channel adaptation, in a preferred embodiment, is performed by removing from the test sample the characteristics of the channel from which the test sample was received. Next, the characteristics of the enrollment channel which were stored by the enrollment component are recalled. The test sample is filtered through the recalled enrollment channel. This type of channel adaptation removes the characteristics from the test channel and supplies the characteristics from the enrollment channel to the test speech so that the test speech matches the transmission channel of the originally enrolled speech.
After channel adaption, feature extraction is performed on the test sample. This occurs as previously described with respect to the enrollment component. After feature extraction, it is desired to locate, or xe2x80x9cspotxe2x80x9d the phrases in the test speech and simultaneously avoid areas of background noise.
The performance of ASV systems can be significantly degraded by background noise and sounds, such as speech and music, that can lead and/or trail the user""s actual spoken password. This is because small differences between the speech and the high volume noise/sounds may lead the preprocessing algorithm to incorrectly treat the background noise and sounds as part of the password. Accordingly, a sample of password including the noise and background sounds will not be recognized. To combat the effects of background noise, the invention uses a key word/key phrase spotter to identify the password phrase.
After key word/key phrase spotting, automatic speech segmentation occurs. Preferably the automatic speech segmentation is not xe2x80x9cblindxe2x80x9d segmentation (although xe2x80x9cblindxe2x80x9d segmentation could be used), but is xe2x80x9cforcexe2x80x9d alignment segmentation. This force segmentation uses the segments previously obtained by the blind segmentation performed in the enrollment component. The test speech is therefore segmented using the segmentation information previously stored. The xe2x80x9cforcexe2x80x9d segmentation results in the identification of subword borders. The subwords undergo multiple classifier fusion.
The multiple classifiers of the enrollment component are used to xe2x80x9cscorexe2x80x9d the subword data, and the scores are the fused, or combined. The result of the fusion is a xe2x80x9cfinal score.xe2x80x9d The final score is compared to the stored threshold. If the final score exceeds or equals the threshold, the test sample is verified as the user""s. If the final score is less than the threshold, the test sample is declared not to be the user""s. The final score and date of verification, as well as other related details, may be stored in the database as well.
The invention also used a number of adaptation techniques, in addition to channel adaptation. These techniques include fusion adaption, threshold adaption and model adaption.
Fusion adaptation modifies the fusion function for n classifiers, S(xcex1). The fusion function provides more weight to some classifiers than to others. Fusion adaptation dynamically reallocates the weight between the classifiers, preferably by changing a fusion constant, xcex1.
Threshold adaptation dynamically modifies the stored threshold value over time. The initial threshold is determined during enrollment using voice samples. By further using information on the success of recent verification attempts, the decision threshold can be better estimated.
Model adaptation changes the models learned during the enrollment component dynamically over time, to track aging of the user""s voice. For example, every time a user is successfully verified, the test data may be considered as enrollment data, and the classifier trained and modeled using the steps following automatic blind segmentation (in the enrollment component). Model adaptation effectively increases the number of enrollment samples and improves the accuracy of the system.
3. xe2x80x9cBootstrappingxe2x80x9d Component Summary
Bootstrapping is used to generate a pool of speech data representative of the speech of nonspeakers, or xe2x80x9cantispeakers.xe2x80x9d This data is used during enrollment to train the discriminant training-based classifiers. Bootstrapping involves obtaining voice samples from antispeakers, preprocessing the voice samples (as in the enrollment phase), and inverse channel filtering the preprocessed voice samples. Inverse channel filtering removes the characteristics of the channel on which the antispeaker voice sample is obtained. After inverse channel filtering, feature generation and automatic blind voice segmentation occur, as in the enrollment component. The segments and feature vectors are stored in an antispeaker database for use by the enrollment component.