Automatic Speech Recognition (“ASR”) systems convert spoken audio into text. As used herein, the term “speech recognition” refers to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text messages), by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g. finding a podcast where particular words were spoken).
As their accuracy has improved, ASR systems have become commonplace in recent years. For example, ASR systems have found wide application in customer service centers of companies. The customer service centers offer middleware and solutions for contact centers. For example, they answer and route calls to decrease costs for airlines, banks, etc. In order to accomplish this, companies such as IBM and Nuance create assets known as IVR (Interactive Voice Response) that answer the calls, then use ASR (Automatic Speech Recognition) paired with TTS (Text-To-Speech) software to decode what the caller is saying and communicate back to them.
More recently, ASR systems have found application with regard to text messaging. Text messaging usually involves the input of a text message by a sender who presses letters and/or numbers associated with the sender's mobile phone. As recognized for example in the aforementioned, commonly-assigned U.S. patent application Ser. No. 11/697,074, it can be advantageous to make text messaging far easier for an end user by allowing the user to dictate his or her message rather than requiring the user to type it into his or her phones. In certain circumstances, such as when a user is driving a vehicle, typing a text message may not be possible and/or convenient, and may even be unsafe. On the other hand, text messages can be advantageous to a message receiver as compared to voicemail, as the receiver actually sees the message content in a written format rather than having to rely on an auditory signal.
Many other applications for speech recognition and ASR systems will be recognized as well.
Of course, the usefulness of an ASR system is generally only as good as its speech recognition accuracy. Recognition accuracy for a particular utterance can vary based on many factors including the audio fidelity of the recorded speech, correctness of the speaker's pronunciation, and the like. The contribution of these factors to a recognition failure is complex and may not be obvious to an ASR system user when a transcription error occurs. The only indication that an error has occurred may be the resulting (incorrect) transcription text.
Some ASR systems are able to provide an indication of confidence the transcription performance. The confidence might be expressed as a number, such as a percentage on a scale of 0% to 100%. In addition, an indication of interference (background noise, etc) may be given. However, known systems do not provide an approach whereby transcription metrics, such as metrics relating to confidence or interference, can be communicated to the user of an ASR system by graphical or audio integration into the results of the transcription, while minimizing user interface clutter and distraction.
Additionally, when speech is transcribed to text, some natural speech elements can be lost during the transcription process. Specifically, verbal volume, or emphasis, as well as pauses between words and phrases, are difficult to render within a language model. Known systems do not provide an approach for at least partially compensating for these shortcomings by recreating these missing elements as visual cues.