Recognition of the human voice is currently employed in a multitude of mobile devices such as cell phones, personal digital assistants (PDAs) and wearables (MP3 players, watch phones, etc.). A very important criterion here for acceptance of a voice recognition system by the user is the rejection of words which are not contained in the recognized vocabulary (out-of-vocabulary rejection, OOV rejection).
The underlying technical method used here for speaker-independent voice recognition involves the classification of recognition results into categories for reliability of recognition, e.g. [reliably recognized, unreliably recognized, not in the vocabulary]. Typically the number and names of these categories vary depending on the voice recognition technology used, as does the way they are handled in the voice application. Thus for example it is conceivable that a voice application sends the user a query if words are not reliably recognized. Thus for a voice recognition system the problem is to provide as precise and error-free an allocation as possible to one of the above-mentioned categories for each identification result.
Generally the basis for allocation to categories when classifying identification results is what is known as a confidence measure, which the voice recognition system calculates for each identification result. The literature provides a multitude of algorithms for calculating this measure. Of significance is the framework in which suitable confidence measure threshold values are determined. These define the above-mentioned categories for reliability of recognition. It should be noted that a well chosen threshold depends not only on the language and the modeling used (e.g. Hidden Markov Model) but also on the speaker and the recognizer vocabulary.
Previous proposed solutions have been based on the costly, critical and not always suitable a-priori determination of confidence thresholds on the basis of databases in the laboratory. These are explained below for three types of voice recognition:                a) Speaker-independent (SI) voice recognition        
Speaker-independent recognition is based for example on Hidden Markov modeling. It offers convenience for the user, since no special training (preliminary speech, enrollment) for the words to be recognized is required. However, the vocabulary to be recognized must be known a priori. Typically in phoneme-based voice recognition systems this is phonetic or graphemic information about the words to be recognized. There are standard methods for converting the graphemes of a word, i.e. its written form, into its phonetic form, which is the form required by the voice recognition system. Various methods exist for determining confidence threshold values, either at vocabulary level or at word level. These methods are based on analyzing the (in this case known) information about the (phonetic) word modeling.                b) Speaker-dependent (SD) voice recognition        
An example of speaker-dependent recognition is directory name selection for a cell phone. The names from the telephone directory are typically trained beforehand on a speaker-dependent basis (SD enrollment). Based on the spoken form of a word an acoustic model is generated for recognition. The standard methods of speaker-independent recognition do not apply here, and the thresholds for Si recognition are not transferable. Moreover there is a strong reliance on the chosen method of speaker-specific word modeling. Pre-set confidence measure threshold values for speaker-dependent vocabularies are typically not adapted to a speaker or a vocabulary and thus are per se less than optimal. It may even be the case that they cannot be used at all.
The known proposed solutions also include the—not very desirable—situation in which the user exerts direct influence on the thresholds, i.e. the user is forced to influence the ‘severity’ of rejection in the recognition system himself.                c) Speaker-adaptive (SA) voice recognition        
This is a hybrid form of speaker-independent and speaker-dependent recognition: speaker-independent modeling of a word or vocabulary is adapted to a speaker by adaptive training. The aim is to improve the recognition rate by capturing speaker-specific characteristics. Depending on the recognition technology used, adaptation to a speaker can be at phoneme level or word level. Similar to the case of SD, no solutions are known for taking account of the effect of the additional training/adaptation process on the confidence threshold.