The present invention relates to speaker recognition. In particular, the present invention relates to training and using models for speaker recognition.
A speaker recognition system identifies a person from their speech. Such systems can be used to control access to areas or computer systems as well as tailoring computer settings for a particular person.
In many speaker recognition systems, the system asks the user to repeat a phrase that will be used for recognition. The speech signal that is generated while the user is repeating the phrase is then used to train a model. When a user later wants to be identified by their speech, they repeat the identification phrase. The resulting speech signal, sometimes referred to as a test signal, is then applied against the model to generate a probability that the test signal was generated by the same person who produced the training signals.
The generated probability can then be compared to other probabilities that are generated by applying the test signal to other models. The model that produces the highest probability is then considered to have been produced by the same speaker who generated the test signal. In other systems, the probability is compared to a threshold probability to determine if the probability is sufficiently high to identify the person as the same person who trained the model. Another type of system would compare the probability to the probability of a general model designed to represent all speakers.
The performance of speaker recognition systems is affected by the amount and type of background noise in the test and training signals. In particular, the performance of these systems is negatively impacted when the background noise in the training signal is different from the background noise in the test signal. This is referred to as having mismatched signals, which generally provides lower accuracy than having so-called matched training and testing signals.
To overcome this problem, the prior art has attempted to match the noise in the training signal to the noise in the testing signal. Under some systems, this is done using a technique known as spectral subtraction. In spectral subtraction, the systems attempt to remove as much noise as possible from both the training signal and the test signal. To remove the noise from the training signal, the systems first collect noise samples during pauses in the speech found in the training signal. From these samples, the mean of each frequency component of the noise is determined. Each frequency mean is then subtracted from the remaining training speech signal. A similar procedure is followed for the test signal, by determining the mean strength of the frequency components of the noise in the test signal.
Spectral subtraction is less than ideal as a noise matching technique. First, spectral subtraction does not remove all noise from the signals. As such, some noise remains mismatched. In addition, because spectral subtraction performs a subtraction, it is possible for it to generate a training signal or a test signal that has a negative strength for a particular frequency. To avoid this, many spectral subtraction techniques abandon the subtraction when the subtraction will result in negative strength, using a flooring technique instead. In those cases, the spectral subtraction technique is replaced with a technique of attenuating the particular frequency.
For these reasons, a new noise matching technique for speaker recognition is needed.