The present invention relates generally to methods and apparatus for verifying speakers such as voice recognition systems.
Methods for verifying speakers (hereafter also "speaker verification") generally use person-specific properties of the human voice as biometric features. The identity check of a person becomes possible with them on the basis of a brief voice (or: speech) sample of the person. In such methods, speaker-specific features are usually extracted from at least one digital voice (or: speech) sample. Acoustic features that reflect the person-specific dimensions of the vocal tract and the typical time sequence of the articulation motions are particularly suitable as such features.
In speech recognition methods, there generally are two different phases, a training phase and a test phase.
In a training phase, expressions prescribable by a user are spoken into an arrangement that implements the method for speaker verification in what are referred to as text-dependent speaker verification methods. Reference feature vectors that contain speaker-specific features extracted from the digital reference voice (or: speech) sample are formed for these reference voice (or: speech) samples. For determining the individual reference feature vectors or, respectively, feature vectors from the voice (or: speech) signals, the respective voice (or: speech) signal is usually divided into small pseudo-stationary sections, which are referred to as frames. The voice (or: speech) signal is assumed to be stationary for the pseudo-stationary sections. The pseudo-stationary sections typically exhibit a time length of about 10 to 20 ms.
In the test phase, at least one feature vector, and usually a plurality of feature vectors, are formed for a spoken voice (or: speech) signal, this or these being compared to at least one reference feature vector from that formed from a recent voice (or: speech) sample, i.e., the voice (or: speech) sample just spoken by the person to be verified. Given an adequately small difference, i.e. given great similarity between the feature vector and the reference feature vector, the speaker is accepted as the speaker to be verified. The tolerance range for the decision as to when a speaker is to be accepted or, respectively, rejected as the speaker to be verified is usually determined in the training phase. However, this range is also freely prescribable during the test phase depending on the required security demands to made of the verification method.
The above-described method wherein a decision as to whether the speaker is accepted as the speaker to be verified is made on the basis of a comparison of the at least one feature vector to the reference feature vector is known from the document: S. Furui, Cepstral Analysis Technique for Automatic Speaker Verification, IEEE Transactions ASSP, Vol. ASSP-29, No. 2, pp. 254-272, April 1981, fully incorporated herein by reference.
A considerable disadvantage of the method described by S. Furui is that the method exhibits considerable uncertainty in the verification of the speaker. The uncertainty results in that a decision threshold for the acceptance or rejection of the speaker must be defined. The definition of the decision threshold ensues only on the basis of voice (or: speech) samples of the user to be verified.
A method for the pre-processing of spoken voice (or: speech) signals in the voice processing as well as basics about feature extraction and feature selection, i.e. basics about the formation of feature vectors for the voice signals, is also known, for example from the document: G. Ruske, Automatische Spracherkennung, Methoden der Klassifikation und Merkmalsextraction, Oldenbourg-Verlag, ISBN 3-486-20877-2, pp. 11-22 and pp. 69-105, 1988, fully incorporated herein by reference.
In addition, B. Kammerer and W. Kupper, Experiments for Isolated Word Recognition, Single-and Two-Layer-Perceptrons, Neural Networks, Vol. 3, pp. 693-706, 1990, full incorporated herein by reference, discloses that a plurality of voice (or: speech) samples be derived from the voice (or: speech) sample by time distortion from a reference voice (or: speech) sample for the formation of a plurality of reference feature vectors in speaker recognition.