One illustrative application of speech recognition technology applies to the workplace. Speech recognition systems have simplified many tasks particularly for a user in the workplace by permitting the user to perform hands-free communication with a computer as a convenient alternative to communication via conventional peripheral input/output devices. For example, a warehouse or inventory worker user could wear a wireless wearable terminal having a speech recognition system that permits communication between the user and a central computer system so that the user can receive work assignments and instructions from the central computer system. The user could also communicate to the central computer system information such as data entries, questions, work progress reports and work condition reports. In a warehouse or inventory environment, a user can be directed (through an instruction from the central computer system or visually by means of a display) to a particular work area that is labeled with a multiple-digit number (check-digit) such as “1-2-3” and asked to speak the check-digit. The user would then respond with the expected response “1-2-3”. (Note that a “check-digit” can be any word or sequence of words, and is not limited to digits.)
Other such examples of communication between a user and speech recognition system are described in U.S. Patent Application No. 2003/0154075 and include environments where a wearable or portable terminal is not required such as in an automobile or a telephone system; environments that are not in a warehouse such as in a managed care home, nursing home, pharmacy, retail store, and office; voice-controlled information processing systems that process, for example, credit card numbers, bank account numbers, social security numbers and personal identification numbers; other applications such as command and control, dictation, data entry and information retrieval applications; and speech recognition system features such as user verification, password verification, quantity verification, and repeat/acknowledge messages. The inventions presented here can be used in those applications. In using a speech recognition system, manual data entry is eliminated or at the least reduced, and users can perform their tasks faster, more accurately and more productively.
Example Speech Recognition Errors
Errors can be made by a speech recognition system however, due to for example background noise or a user's unfamiliarity or misuse of the system. The errors made by a system can be classified into various types. A metric, an error rate (which can be defined as the percentage or ratio of observations with speech recognition errors over the number of observations of the system and which can be determined over a window of time and/or data and per user) is often used to evaluate the number and types of errors made by a speech recognition system and is thus useful in evaluating the performance of the system. An observation can be defined as any speech unit by which speech recognition may be measured. An observation may be a syllable, a phoneme, a single word or multiple words (such as in a phrase, utterance or sentence). When counting the number of observations of the system, the observations input to the system may be counted or the observations output by the system may be counted. One skilled in the art will also know and understand that an accuracy rate (which can be defined as the percentage or ratio of correct observations of the system over the number of observations of the system and which can be determined over a window of time and/or date and per user) can be used to evaluate the performance of the system. Recognition rates can be defined in a variety of other ways, such as a count of observations with errors divided by a length of time, a count of correct observations divided by a period of time, a count of observations with errors divided by a number of transactions, a count of correct observations divided by a number of transactions, a count of observations with errors after an event has occurred (such as apparatus being powered on or a user starting a task), or a count of correct observations after an event has occurred, to name a few. Therefore, a recognition rate (which can be an error rate, an accuracy rate, a rate based upon the identification or counting of observations with errors or correct observations, or other type of recognition rate known to those skilled in the art) is useful in evaluating the performance of the system. In general, a recognition rate can be determined for a word or for various words among a set of words, or for a user or multiple users. Identification of a system's errors can be done by comparing a reference transcription of a user's input speech to the hypothesis generated by the system (the system's interpretation of the user's input speech). Furthermore, as known to those skilled in the art, the comparison can be time-aligned or text-aligned.
One type of speech recognition error is a substitution, in which the speech recognition system's hypothesis replaces a word that is in the reference transcription with an incorrect word. For example, if system recognizes “1-5-3” in response to the user's input speech “1-2-3”, the system made one substitution: substituting the ‘5’ for the ‘2’.
Another type of speech recognition error is a deletion, in which the speech recognition system's hypothesis lacks a word that is in the reference transcription. For example, if system recognizes “1-3” in response to the user's input speech “1-2-3”, the system deleted one word, the ‘2’. One variation of the deletion error is a deletion due to recognizing garbage, in which the system erroneously recognizes a garbage model instead of recognizing an actual word. Another variation of the deletion error is a deletion due to a speech misdetection, where the system fails to detect that the audio input to the system contains speech and as a result does not submit features of the audio input to the system's search algorithm. Another type of deletion occurs when the system rejects a correct observation due to a low confidence score. Yet another variation of the deletion error is a deletion due to a rejected substitution, where a search algorithm of the speech recognition generates a substitution, which is later rejected by an acceptance algorithm of the system. Still another type of deletion, occurring in time-aligned comparisons, is a merge: the speech recognition system recognizes two spoken words as one. For example, the user says “four-two” and the system outputs “forty”.
In this application, a garbage model refers to the general class of models for sounds that do not convey information. Examples may include for example models of breath noises, “um”, “uh”, sniffles, wind noise, the sound of a pallet dropping, the sound of a car door slamming, or other general model such as a wildcard that is intended to match the input audio for any audio that doesn't match a model in the library of models.
Yet another type of speech recognition error is an insertion, in which the speech recognition system's hypothesis includes a word (or symbol) that does not correspond to any word in the reference transcription. Insertion errors often occur when the system generates two symbols that correspond to one symbol. One of these symbols may correspond to the reference transcription and be tagged as a correct observation. If it does not correspond to the reference transcription, it can be tagged as a substitution error. In either case, the other symbol can be tagged as an insertion error. Insertion errors are also common when noise is mistakenly recognized as speech.
In contrast to determining that an actual error or correct observation occurred by comparing a system's hypothesis to a reference transcript, an error or correct observation can be estimated or deemed to have occurred based on system behavior and user behavior. This application describes methods for determining a recognition rate, wherein the recognition rate is an estimate based on estimated errors or estimated correct observations deemed to have occurred after evaluating system and user behavior. Accordingly, one can estimate or evaluate the performance level of the speech recognition system by detecting in this manner the various errors committed by or correct observations of the system. One way to detect a speech recognition error is based on feedback a user provides to the speech recognition system. Feedback can be requested by the speech recognition system. For example, the system could ask the user to confirm the system's hypothesis by asking the user for example “Did you say 1-5-3?”, and if the user responds “no”, it indicates that the system made an error recognizing “1-5-3”. Another type of feedback is based on a user's emotion detected by speech recognition. For example, if the system recognizes in the user's input speech that the user is sighing or saying words indicating aggravation, it may indicate that an error occurred. Yet another type of feedback is based on a user's correction command to the system, such as the user speaking “back-up” or “erase”, or the user identifying what word was spoken (which could be from a list of possible words displayed by the system). When a correction is commanded to the system, it may be that an error occurred.
Assessing the Performance of a Speech Recognition System
Errors made by a speech recognition system for a particular user or multiple users in the same environment occur due to various reasons. Environmental factors such as background noise influence the performance of a speech recognition system. Furthermore, a particular user may report a system's poor recognition accuracy when other users in the same environment do not report similar problems, for various reasons. One reason may be that the models used by the speech recognition system are not well-matched to the user's speech patterns. Another possible reason may be that the user's expectations of the system are higher than other users and are unrealistic. Another possible reason is that the user is being uncooperative or is tired of working and blames the system for the user's poor performance in order to get a “troubleshooting break”.
One common way to assess the situation is for the supervisor to listen in to the worker while he performs his job. However, this is a time consuming process and because a user may alter his or her behavior and speech patterns when being observed, this method often does not yield satisfactory results. Furthermore, this method requires the supervisor to have the expertise of knowing how to assess a system and user's performance, knowing what is acceptable performance and knowing how to improve the performance. There are other methods for assessing performance, but these methods require taking a transcript of the user's speech and the output of the speech recognition system and performing an analysis.
Therefore, it is useful to provide a way for a supervisor to assess performance of a speech recognition system when the system is used by a particular user or set of users, determining if a problem exists and if so, how to correct it. Furthermore, it is useful to discriminate between actual speech recognition problems (due to for example environmental influences or a user not knowing how to effectively use the system) and user misbehavior. In addition, it is useful to assess the performance of a system and provide a report of this assessment without creating or manually correcting a transcription of the audio processed by the speech recognition system. It is also useful to communicate the report for example to the user on the portable terminal or to another person (such as a supervisor or a professional services support person) on a management console such as one at a central computer system. Further, it is useful to identify to the user or other person (such a supervisor or professional services support person) when a system is having recognition problems and accordingly instruct the user to take corrective action to fix the recognition problems. Several such systems and methods are disclosed in example embodiments disclosed herein.
Model Adaptation for a Speech Recognition System
The information provided by a performance assessment does not only provide helpful information to a user or a supervisor; a performance assessment can be used to improve the adaptation of a speech recognition system. A speech recognition system can improve its performance over time, as more speech samples are processed by a system, by improving its acoustic models through training or other learning or adaptation algorithms. At the same time, it is useful to prevent the system from adapting in an undesirable way, thereby resulting in a system that performs worse than it did prior to adaptation or a system that degrades over time. Adapting models can use significant computational, storage, and/or power resources to create the adapted models and radio transmission energy to transmit the new models to a server. Example embodiments of the invention disclosed herein can control the adaptation of a speech recognition system to avoid inefficient use of resources and to avoid adapting away from well-performing models, by controlling or adjusting adaptation based on a performance assessment of the system.