Speaker recognition systems have two different applications. They can be used for speaker verification, in which it is confirmed or refused that a person who is speaking is the specified person. In this case, two voice prints are compared. The other application is speaker identification which may be used to decide which of a number of persons whose voice prints are known to the system the person who has been speaking corresponds to. In such systems used for speaker identification, it is possible that the speaker who is speaking is not included in the set of known persons (open set) or they may be operated in such a way that the speakers are always in the set of persons known to the system (closed set). Usually, such speaker recognition systems comprise for every speaker enrolled in the system a speaker model describing the voice print of the speaker (the voice print comprising features typical for the speaker).
In current speaker recognition systems, it may be a problem to identify whether the recognition system provides reliable decisions. In particular, in noisy environments or in case of channel mismatch (a channel being everything between the person speaking and the recording medium), current speaker recognition systems may provide unreliable results. Such a channel mismatch may for example happen if, a voice signal is transmitted in a manner that is not known to the system and has not been used for training.
Several attempts to overcome these problems have been made. Examples are the publications by M. C. Huggins and J. J. Grieco: “Confidence Metrics for Speaker Identification” published in the 7th ICSLP, Denver, Colo., 2002, or the document “Using Quality Measures for Multilevel Speaker Recognition”, Computer Speech and Language, 2006; 20(2-3):192-209 by D. García-Romero, et al. Further attempts have been made by W. M. Campbell et al. in the document “Estimating and Evaluating Confidence for Forensic Speaker Recognition” in ICASSP 2005; 717-720 and in “Considering Speech Quality in Speaker Verification Fusion” in Inter-speech 2005 by Y. Solewicz and M. Koppel and the two documents by J. Richiardi et al., titled “A Probabilistic Measure of Modality Reliability in Speaker Verification” in ICASSP, 2005 and the document “Confidence and Reliability Measures in Speaker Verification” published in the Journal of the Franklin Institute 2006; 343 (6): 574-595.
In some of these approaches Bayesian Networks (BN) are used. One document which may help to understand Bayesian Networks is for example “Pattern Recognition and Machine Learning” by C. Bishop, published in Springer Science and Business Media, LLC, 2006.
A Bayesian Network is a probabilistic graphical model representing a set of (random) variables and their conditional dependencies. Their nodes may represent one or more of observed and/or hidden variables and/or hypotheses and/or deterministic parameters.
A variable depending on another variable will be represented in a Bayesian Network by an arrow pointing from the first variable (parent variable), on which the second variable (child variable) is dependent, to the second (dependent) variable.
Such a network may be trained. With such a (trained) network, given a set of known (observed) parameters, the probability of a hidden variable may be estimated.
Previous works on reliability based on Bayesian Networks may have the disadvantage that the parameters of the Bayesian Network may depend on the speaker recognition threshold (working point), as for example in the publication by Richiardi et al. in the ICASSP '05. In that case, a modification of the working point would require a new and complete Bayesian Network training.
Further problems which may be present in the prior art are, for example, the fact that signal degradation may affect the reliability of the trial different if the trial is target or non-target and/or that for the training process, clean and degraded realizations of the same utterances (which is also called stereo data) may be needed. In particular, this may mean that to train prior art systems it may be necessary to have the training utterances as signals with and without distortions, for example, caused by channels, speaker stress, data quality, convolution, added noise or other influences that degrade data. All these data are not always easy to be provided and sometimes the correlation between the reliability and the signal distortion is unknown.
Finally, the prior art has shown that the reliability of a trial (comparison between one testing audio and one speaker model) is deeper related with the signal quality of both the testing audio(s) and model audio(s) than with individual signal quality of testing audio(s) or model audio(s) only. A speaker model as used in this text is usually built by a speaker recognition system using one, two, three or more model audios.