1. Technical Field
The present disclosure relates to weighting in automatic speech recognition, and more specifically, to modifying weights in automatic speech recognition based on human judgments.
2. Introduction
Measuring accuracy in Automatic Speech Recognition (ASR) technologies commonly relies on Word Error Rate (WER). WER considers every word equally important when measuring ASR accuracy, and considers all errors made by ASR equally bad. However, in practice, the impact of all errors is not the same. Some errors have a sufficiently high impact to substantially impair the ability of a user to understand the message, while other errors have a low impact, such that the user can easily understand the important parts of the message despite the errors. Whether the transcript produced by ASR captures the meaning of the spoken message is far more important than the correct transcription of every word.
Determining whether the meaning has been successfully captured can require knowledge of which words matter to the listener. One common technique for instructing ASR models as to which words are important is to assign words a saliency weight, such that salient words are important to the user and non-salient words are less important. The trouble with this technique is that every individual user is unique, such that message content important to user A may have little consequence to user B. Moreover, what a user considers important in one exchange may vary in a separate exchange. On top of these challenges, the ASR producer is challenged to accurately produce salient values for words which can be used by a broad spectrum of the populace. These and other problems provide challenges in using ASR to recognize speech in an easily human recognizable form.