Speech recognition systems have simplified many tasks particularly for a user in the workplace by permitting the user to perform hands-free communication with a computer as a convenient alternative to communication via conventional peripheral input/output devices. For example, a user could wear a wireless wearable terminal having a speech recognition system that permits communication between the user and a central computer system so that the user can receive work assignments and instructions from the central computer system. The user could also communicate to the central computer system information such as data entries, questions, work progress reports, and work condition reports. In a warehouse or inventory environment, a user can be directed (through an audio instruction from the central computer system or visually by means of a display) to a particular work area that is labeled with a multiple-digit number (check-digit) such as “1-2-3” and be asked to speak the check-digit. The user would then respond with the expected response “1-2-3”. Note that a “check-digit” can be any word or sequence of words, and is not limited to digits.) Other such examples of applications and communications where knowledge about the response is known are described in U.S. Patent Application No. 2003/0154075 and include environments where a wearable or portable terminal is not required such as in an automobile or a telephone system; environments that are not in a warehouse such as in a pharmacy, retail store, and office; voice-controlled information processing systems that process for example credit card numbers, bank account numbers, social security numbers and personal identification numbers; other applications such as command and control, dictation, data entry and information retrieval applications; and speech recognition system features such as user verification, password verification, quantity verification, and repeat/acknowledge messages. The inventions presented here can be used in those applications. In using a speech recognition system, manual data entry is eliminated or at the least reduced, and users can perform their tasks faster, more accurately, and more productively.
Errors can be made by a speech recognition system however, due to for example background noise or a user's unfamiliarity or misuse of the system. The errors made by a system can be classified into various types. A metric, the word error rate (which can be defined as the percentage or ratio of speech recognition errors over the number of words input to the system and which can be determined over a window of time and/or data and per user) is often used to evaluate the number and types of errors made by a speech recognition system and is thus useful in evaluating the performance of the system. In general, a word error rate can be determined for a word or for various words among a set of words, or for a user or multiple users. Identification of a system's errors can be done by comparing a reference transcription of a user's input speech to the hypothesis generated by the system (the system's interpretation of the user's input speech). Furthermore, as known to those skilled in the art, the comparison can be performed in a time-aligned mode or in a text-aligned mode.
One type of speech recognition error is a substitution, in which the speech recognition system's hypothesis replaces a word that is in the reference transcription with an incorrect word. For example, if system recognizes “1-5-3” in response to the user's input speech “1-2-3”, the system made one substitution: substituting the ‘5’ for the ‘2’.
Another type of speech recognition error is a deletion, in which the speech recognition system's hypothesis lacks a word that is in the reference transcription. For example, if system recognizes “1-3” in response to the user's input speech “1-2-3”, the system deleted one word, the ‘2’. There are many types of deletion errors. One variation of the deletion error is a deletion due to recognizing garbage, in which the system erroneously recognizes a garbage model instead of recognizing an actual word. Another variation of the deletion error is a deletion due to a speech misdetection, where the system fails to detect that the audio input to the system contains speech and as a result does not submit features of the audio input to the system's search algorithm. Another type of deletion occurs when the system rejects a correct recognition due to a low confidence score. Yet another variation of the deletion error is a deletion due to a rejected substitution, where a search algorithm of the speech recognition generates a substitution which is later rejected by an acceptance algorithm of the system. Still another type of deletion, occurring in time-aligned comparisons, is a merge: the speech recognition system recognizes two spoken words as one. For example, the user says “four two” and the system outputs “forty”.
In this application, a garbage model refers to the general class of models for sounds that do not convey information. Examples may include for example models of breath noises, “um”, “uh”, sniffles, wind noise, the sound of a pallet dropping, the sound of a car door slamming, or other general model such as a wildcard. (A wildcard is intended to match the input audio for any audio that doesn't match a model in the library of models.)
Yet another type of speech recognition error is an insertion, in which the speech recognition system's hypothesis includes a word (or symbol) that does not correspond to any word in the reference transcription. Insertion errors often occur when the system generates two symbols that correspond to one symbol. One of these symbols may correspond to the reference transcription and be tagged as a correct recognition. If it does not correspond to the reference transcription, it can be tagged as a substitution error. In either case, the other symbol can be tagged as an insertion error. Insertion errors are also common when noise is mistakenly recognized as speech.
In contrast to determining that an actual error occurred by comparing a system's hypothesis to words actually spoken in a reference transcript, an error can be estimated or deemed to have occurred based on system behavior and user behavior. Accordingly, one can estimate or evaluate the performance level of the speech recognition system, by detecting in this manner the various errors committed by the system. One way to detect a speech recognition error is based on feedback a user provides to the speech recognition system. Feedback can be requested by the speech recognition system. For example, the system could ask the user to confirm the system's hypothesis by asking the user for example “Did you say 1-5-3?”, and if the user responds “no”, it indicates that the system made an error recognizing “1-5-3”. Another type of feedback is based on a user's emotion detected by speech recognition. For example, if the system recognizes in the user's input speech that the user is sighing or saying words indicating aggravation, it may indicate that an error occurred. Yet another type of feedback is based on a user's correction command to the system, such as the user speaking “back-up” or “erase”, or the user identifying what word was spoken (which could be from a list of possible words displayed by the system). When a correction is commanded to the system, it may indicate that an error occurred.
A speech recognition system can improve its performance over time, as more speech samples are received and processed by a speech recognition system, by improving its acoustic models through training or other learning or adaptation algorithms. At the same time, it is useful to prevent the system from adapting in an undesirable way, thereby resulting in a system that performs worse than it did prior to adaptation or a system that degrades over time. Avoiding additional processing by a speech recognition system due to adaptation of acoustic models is particularly useful in many applications, particularly those employing a battery powered mobile computer, wireless network, and server to store models. Adapting models can use significant computational resources to create the adapted models and radio transmission energy to transmit the new models to the server. Example embodiments of the invention disclosed herein can control the rate of adaptation of the speech recognition system to avoid inefficient use of computational, storage and/or power resources and to avoid adapting away from well-performing models. Example embodiments of the invention control adaptation by using triggers, which are based on an error rate determination (which may be based on an error rate estimation), to cause the adaptation of prior models or create new models. The invention also discloses methods by which recognition error rates can be estimated.