1. Field of the Invention
The present invention relates to automatic speaker voice recognition, and more particularly to verification of a speaker authorized to access a service application, whether independently of or depending on the content of the voice segment spoken by the speaker, such as a password.
2. Description of the Prior Art
Speaker verification, or voice authentication, is an ergonomic way of securing access. Unfortunately, its present performance does not assure total security.
A developer of speaker verification means in an automatic voice recognition device, which constitutes the subject matter of the invention, must achieve a compromise between an authorized level of fraud corresponding to impostors accessing the application and the required level of ergonomy, corresponding to a rate of acceptance of legitimate speakers to whom the service application cannot be refused.
The compromise between security and ergonomics conditions the value of a decision threshold. Any speaker verification method yields a verification score that represents the similarity between a presumed authorized speaker voice model and an unknown speaker voice segment seeking access to the application. The verification score is then compared to the decision threshold. Depending on the result of this comparison, the device decides whether to accept or to reject the unknown speaker, in other words whether or not to authorize the speaker to access the application. If the decision threshold is severe and thus high, few impostors will be accepted by mistake, but authorized speakers will be rejected. If the decision threshold is lax and thus weak, few authorized speakers will be rejected but many impostors will be accepted.
The difficulty therefore lies in determining the decision threshold, especially since, for the same rate of acceptance, the threshold varies from one speaker to another (“A COMPARISON OF A PRIORI THRESHOLD SETTING PROCEDURES FOR SPEAKER VERIFICATION IN THE CAVE PROJECT” J.-B. PIERROT et al., Proceedings ICASSP, 1998).
Thus the distribution of the verification scores depends on the speaker voice model used to calculate them. Optimum speaker verification therefore requires a respective decision threshold for each model.
One way to circumvent the speaker sensitivity of the threshold is to normalize the distribution of the verification scores. Applying an appropriate transformation to render the distributions of the scores independent of the speaker model solves the problem of searching for a threshold for each speaker, i.e. for each speaker model. Thus the problem is shifted to that of finding a way of normalizing the scores.
In the “z-norm” method described in the paper “A MAP APPROACH, WITH SYNCHRONOUS DECODING AND UNIT-BASED NORMALIZATION FOR TEXT-DEPENDENT SPEAKER VERIFICATION”, Johnny MARIETHOZ et al., Proceedings ICASSP, 2000, the verification score distribution is normalized by means of parameters μx and σx of the distribution of estimated impostor scores over a population of impostors. If sX(Y) is the verification score for a voice segment Y to be tested against an authorized speaker model X, the verification score normalized by the z-norm method is:
                    s        ~            x        ⁡          (      Y      )        =                              s          x                ⁡                  (          Y          )                    -              μ        x                    σ      x      in which μx and σx are respectively the mean and the standard deviation of the impostor score distribution for the model X. These normalization parameters are estimated beforehand, during a learning phase, using a database of recordings that are considered to be plausible occurrences of imposture for the speaker model X.
Providing the necessary database of recordings of speakers considered as impostors relative to the authorized speaker is conceivable if the verification of the speaker is a function of a password known to the voice recognition device. This assumes that the developer of the service application will have collected beforehand recordings of persons speaking the password in a context close to the application so that the recordings represent plausible occurrences of imposture tests. This necessary collection of recordings makes it difficult to change the password in a system with a password fixed by the device, and makes it impossible for the authorized speaker using the application to choose a password.
In the more ergonomic situation in which the user chooses the password himself during the learning phase, it is practically impossible to collect recordings of the password by a set of other speakers.
Furthermore, to improve the ergonomics of some applications, during a very short learning phase known as the enrolment phase, a voiceprint of the authorized user speaker is created by generating a voice model for him.
To enrich the model, the authorized speaker voice model is adapted as and when it is used with speech recordings validated by the application or by a decision algorithm, as described in the paper “ROBUST METHODS OF UPDATING MODEL AND A PRIORI THRESHOLD IN SPEAKER VERIFICATION”, Tomoko MATSUI et al., Proceedings ICASSP, 1996, p. 97-100. If a user has been recognized, his speech recorded during the access request is used to update his model. This updating enriches the model and takes account of changes in the voice of the authorized speaker over time.
Since the model is enriched, the distribution of the scores is modified and the decision threshold initially defined may become unsuited to the application. This is because the verification scores for an authorized speaker-user improve as more data is used to define the model. If the decision threshold is made relatively lax, so as not to reject too many authorized users in the initial configuration, it is also relatively permissive and allows a large number of impostors to access the application. Because the speaker voice model is enriched as and when access is requested, the distributions of the scores are modified, which can lead to a very low level of rejection of authorized speakers and a relatively high rate of acceptance of impostors, whereas modification of the decision threshold would obtain the full benefit of the enrichment of the model and would preserve a low rate of erroneous rejection combined with a low rate of acceptance of impostors.
In the paper previously cited, MATSUI et al. propose to adapt the decision threshold when the speaker model is adapted. This adaptation is therefore applied directly to the decision threshold for an expected operating point.
The adaptation of the threshold as proposed by MATSUI et al. assumes that the device has retained all of the voice recordings necessary for the learning period and for the adaptation of the speaker model in order to be able to determine a set of verification scores that will be used to estimate a decision threshold for that set. That threshold is interpolated with the old threshold to obtain the new threshold.
This adaptation of the threshold has the following drawbacks. Firstly, occurrences of impostor recording are necessary, which is unrealistic in some applications. Secondly, the speaker speech recordings must be retained in order to re-estimate the decision threshold, which implies a non-negligible cost in terms of memory. Finally, because re-estimation is done at the level of the decision threshold, i.e. for a required operating point, if it is required to modify the operating point for ergonomic reasons, for example, then all the interpolation parameters have to be modified.