Automatic Speech Recognition (“ASR”) systems convert speech into text. As used herein, the term “speech recognition” refers to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text), by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g. finding a podcast where particular words were spoken).
As their accuracy has improved, ASR systems have become commonplace in recent years. For example, ASR systems have found wide application in customer service centers of companies. The customer service centers offer middleware and solutions for contact centers. For example, they answer and route calls to decrease costs for airlines, banks, etc. In order to accomplish this, companies such as IBM and Nuance create assets known as IVR (Interactive Voice Response) that answer the calls, then use an ASR system paired with TTS (Text-To-Speech) software to decode what the caller is saying and communicate back to him.
More recently, ASR systems have found application with regard to text messaging. Text messaging usually involves the input of a text message by a sender who presses letters and/or numbers associated with the sender's mobile phone. As recognized for example in the aforementioned, commonly-assigned U.S. patent application Ser. No. 11/697,074, it can be advantageous to make text messaging far easier for an end user by allowing the user to dictate his or her message rather than requiring the user to type it into her phone. In certain circumstances, such as when a user is driving a vehicle, typing a text message may not be possible and/or convenient, and may even be unsafe. On the other hand, text messages can be advantageous to a message receiver as compared to voicemail, as the receiver actually sees the message content in a written format rather than having to rely on an auditory signal.
Many other applications for speech recognition and ASR systems will be recognized as well.
Currently, the state-of-the-art speech transcription engines use statistical language models (“SLMs”) to transcribe free-form speech into text. This is in contrast to using finite grammars which describe patterns of words which can be spoken by the user and received and processed by the ASR system. Finite grammars are much more limited to phrases, which the engine can recognize, but generally provide better accuracy. The current state of speech recognition engines allows either an SLM or a finite grammar to be active when transcribing speech from audio data, but not both at the same time.
Thus, an approach is needed where an ASR system makes use of both the SLM for returning results from the audio data, and finite grammars used to post-process the text results. An approach is also needed where custom filters are used that are configured to detect and modify words and word groups. Using this approach permits text results to be generated that can be presented to a user formatted in a way that looks more typical of how a human would have written a text message. It will be recognized that this same principle is useful in other applications of ASR engines as well.