Automatic, or computer-based, speech recognition is a process by which human speech signals are converted to a sequence of words in digital form using an algorithm implemented in a computer program. Generally, these algorithms are based upon mathematical and/or statistical models for identifying and transcribing words from digitized speech signals.
An important ingredient of virtually any speech recognition application is the speech recognition grammar of the particular application. Many speech recognition applications utilize relatively straightfoward speech recognition grammars having limited vocabularies comprising words and/or phrases common to a particular application. For example, a speech recognition application for handling calls pertaining to a banking customer's account might well employ a speech recognition grammar containing entries such as “checking,” “savings,” “loan,” and “money market.”
Other speech recognition grammars are utilized for a more specialized purpose, namely, that of gathering information to establish the identify of the speaker. It is worth noting that applications that utilize such grammars are distinct from those that perform voice matching—also known as voice identification, voice verification, and voice biometrics—in that speech recognition identification typically only entails the more standard recognition technologies for converting speech to text.
Speech recognition applications for establishing a speaker's identity are generally based on one of two distinct approaches. According to the first approach, a speech recognition grammar comprises a single entry, the target entry, which is a particular piece of information that the speaker is expected to know and enunciate to the speech recognition system. For example, a call processing system may prompt a caller for the caller's account number, and then, having ascertained the caller's surname from the account information, requests that the caller speak his or her surname. Thus, the caller's identity can be confirmed if the spoken surname matches that known to the system and corresponding to the account information. At this point during the procedure, the active speech recognition grammar utilized by the system comprises only a single entity: the surname corresponding to the account information. Since the speech recognition application is designed to optimally match, in a probablistic sense, a speech utterance to a grammar entry, the confidence level can be designed to provide the best trade-off between false acceptance of an incorrect utterance and false rejection of a correct utternace.
A modification of this approach entails adding one or more so-called distractors to the grammar. A distractor is an incorrect response selection. In the context of speech recognition, a distractor adds to the speech recognition grammar a concrete alternative that is distinct from a target entry, thereby increasing the likelihood that the speech recognition system will correctly reject an utterance that does not in fact match the target entry. Conventional speech recognition grammars that utilize distractors typically include a single, static set of distractors. The effectiveness of the distractors generally varies, however, depending on the degree of dissimilarity between the distractors and a particular target entry.