In the last years, speaker verification technology has shown significant progress. For example, U.S. Pat. No. 6,879,968 describes a speaker verification apparatus that solves the problem of the acceptance threshold estimation by means of employing the following procedure:
First, given a speech signal from a user claiming an identity, the distance to every client template or model in the system is determined; and then, the probability density function of these distances is estimated.
Second, a signal score is obtained, depending if the user's distance is above or below given a percentile in the probability density function of distances obtained with all the templates or models in the system; then it decides if the speech signal corresponds to the claimed identity.
The system described in U.S. Pat. No. 6,879,968 partly solves the problem of the decision threshold estimation, but it does not consider explicitly False Acceptation (FA) nor False Rejections (FR) rates. Consequently, it could accept an excessive number of impostors or it could reject an excessive number of clients. Furthermore, the probability density function of distances must be considered for all the clients enrolled in the system, which implies that if an “n” number of clients are registered, the method requires “n” identification distance, which is also termed verification distance, or probability, evaluations for each identity verification event. This results in a non efficient system when “n” is a high number, e.g. over 100, which in fact is possible in a massive and large-scale application as the one described in U.S. Pat. No. 6,879,968. Also, U.S. Pat. No. 6,879,968 does not provide any solution to the problem of limited enrolling or verification data. In comparison, the current invention estimates the acceptance/rejection distance threshold using the desired false acceptance (FA) and false rejection (FR) rates as references according to the application. Of course, only one of the pair of FA and FR can be independently set.
Two key factors that prevent the deployment of speaker verification in large-scale applications are the requirements of long enrolling sessions and long verification sentences to guarantee low error rates (Barras, Meigner and Gauvain, 2004) (Mariethoz & Bengio, 2000). Both requirements are not compatible with large-scale applications on the telephone network because they reduce the usability of the service, and they lead to high traffic load and high rate of blocked calls.
U.S. Pat. No. 6,119,084 describes a speaker verification apparatus and a validation method wherein the user must pronounce one or more sentences for validating his identity; if one of the signals is similar enough regarding the template or model of the client whose identity is being validated, the system adapts and tries to capture the subject or individual voice variations throughout the time. In contrast, if the system does not validate the user identity, an alternative access control mechanism is used. Therefore, as disadvantages of the system can be mentioned the facts that the subject is forced to pronounce too many sentences, and the subject is forced to go to an alternative access control scheme if the voice verifying system does not validate the user's identity. On the other hand, if one of the voice inputs is similar enough to the client's template, this one is adapted to capture the subject variations—this procedure is named “unsupervised adaptation” and it refers to an adaptation process without human assistance. Therefore, if an error takes place in the adaptation procedure, this will be propagated and will result in a less reliable voice validation system. It is worth highlighting that supervised adaptation in not feasible in the context of a large-scale application like the one described here.