(Not Applicable)
(Not Applicable)
1. Technical Field
This invention relates to the field of speech recognition, and more particularly, to detecting and decoding telegraphic speech within a speech recognition system.
2. Description of the Related Art
Speech recognition is the process by which an acoustic signal received by microphone is converted to a set of text words, numbers, or symbols by a computer. These recognized words may then be used in a variety of computer software applications for purposes such as document preparation, data entry, and command and control. Improvements to speech recognition systems provide an important way to enhance user productivity.
Speech recognition systems can model and classify acoustic signals to form acoustic models, which are representations of basic linguistic units referred to as phonemes. Upon receipt of the acoustic signal, the speech recognition system can analyze the acoustic signal, identify a series of acoustic models within the acoustic signal and derive a list of potential word candidates for the given series of acoustic models.
Subsequently, the speech recognition system can contextually analyze the potential word candidates using a language model as a guide. Specifically, the language model can express restrictions imposed on the manner in which words can be combined to form sentences. The language model can express the likelihood of a word appearing immediately adjacent to another word or words. Language models used within speech recognition systems typically are statistical models. A common example of a language model can be an n-gram model. In particular, the bigram and trigram models are exemplary n-gram models typically used within the art.
Conventional speech recognition system language models are derived from an analysis of a grammatical training corpus of text. A grammatical training corpus contains text which reflects the ordinary grammatical manner in which human beings speak. The training corpus can be processed to determine the statistical and grammatical language models used by the speech recognition system for converting speech to text, also referred to as decoding speech. It should be appreciated that such methods are known in the art and are disclosed in Statistical Methods for Speech Recognition by Frederick Jelinek (The MIT Press, 1997), which is incorporated herein by reference.
Telegraphic expressions are commonly used as newspaper headlines, as bulleted lists in presentations, or any other place where brevity may be desired. A telegraphic expression is speech that is limited in meaning and produced without inflections or function words. Function words, also called closed-class words, can include determiners such as xe2x80x9caxe2x80x9d and xe2x80x9cthexe2x80x9d and demonstratives such as xe2x80x9cthisxe2x80x9d or xe2x80x9cthatxe2x80x9d. Other closed-class words can include pronouns, except for nominative case pronouns such as xe2x80x9chexe2x80x9d and xe2x80x9cshexe2x80x9d, auxiliary verbs such as xe2x80x9chavexe2x80x9d, xe2x80x9cbexe2x80x9d, xe2x80x9cwillxe2x80x9d, and auxiliary verb derivatives. Closed-class words serve the functional purpose of tying open-class words, called content words, together. For example, the closed-class words within the grammatical text phrase, xe2x80x9cthe boy has pushed the girlxe2x80x9d, are xe2x80x9cthexe2x80x9d, and xe2x80x9chasxe2x80x9d. By removing these closed-class words, the resulting text, xe2x80x9cboy pushed girlxe2x80x9d is said to be a telegraphic expression. Notably, closed-class words, such as demonstratives and pronouns, typically are comprised of a limited number of members. Such words are said to be closed-class words because new functional words are rarely added to a language. Accordingly, the number of closed-class words remains fairly constant.
In contrast to close-class words, open-class words can contain an infinite number of members. Open-class words can include nouns, verbs, adverbs, and adjectives. These words can be invented and added to a language as a need arises, for example when a new technology is invented.
Human beings can easily and naturally read and speak in terms of telegraphic expressions. Conventional speech recognition systems using grammatical language models, however, can be inaccurate when converting telegraphic speech to text and often introduce errors into the text output. Specifically, because conventional speech recognition systems rely on grammatically based language models, such systems often insert unwanted function words into the textual representation of a received telegraphic user spoken utterance. The unwanted words result in inaccurate decoding of user spoken utterances to text.
The invention disclosed herein concerns a method and a system for use in a speech recognition system for applying a telegraphic language model to a received user spoken utterance. The user spoken utterance can be converted to text, or decoded, using the telegraphic language model. The invention also can include generating the telegraphic language model from an existing training corpus.
In particular, subsequent to generating a telegraphic language model, the speech recognition system can enable or disable decoding using the telegraphic language model, referred to as telegraphic decoding. The speech recognition system can continually calculate a running average of closed-class word confidence scores. If that average falls below a predetermined threshold value, the speech recognition system can begin decoding received user spoken utterances with a conventional grammatically based language model, referred to as a conventional language model, and a telegraphic language model. The resulting text having the highest confidence score can be provided as output text. If the running average later exceeds the threshold value, the speech recognition system can disable the telegraphic decoding. It should be appreciated that if the system has sufficient computational resources, the mechanism for engaging and disabling telegraphic decoding is not necessary. In that case, for example, the speech recognition system can process all received user spoken utterances using both language models, selecting the resulting text having the highest confidence score. Briefly, a confidence score reflects the likelihood that a particular word candidate accurately reflects the user spoken utterance from which the word candidate was derived.
One aspect of the invention can include a method of selecting a language model in a speech recognition system for decoding received user spoken utterances. The method can include the steps of computing confidence scores for identified closed-class words and computing a running average of the confidence scores for a predetermined number of decoded closed-class words. Based upon the running average, the step of selectively enabling telegraphic decoding to be performed can be included. Notably, telegraphic decoding can be enabled in addition to conventional decoding. Also included can be the step of selectively disabling telegraphic decoding based upon the running average.
Another embodiment of the invention can include a method of decoding received user spoken utterances in a speech recognition system. In that case, the method can include decoding the received user spoken utterance with a conventional language model resulting in a first word candidate and decoding the received user spoken utterance with an alternate language model resulting in a second word candidate. The alternate language model can be a telegraphic language model. Also included can be the steps of computing a confidence score for the first word candidate and the second word candidate. The step of selecting the word candidate having the highest confidence score also can be included. The first word candidate and the second word candidate can be the same word, but have different confidence scores. Also, if the first word candidate and the second word candidate are not the same word but have the same confidence scores, either the first or the second word candidate can be selected.
Another aspect of the invention can include a method of developing a telegraphic language model for use with a speech recognition system for converting telegraphic user spoken utterances to text. In that case, the method can include the steps of loading an existing training corpus into a computer system and revising the training corpus by removing closed-class words from the training corpus. The step of developing a telegraphic language model from the revised training corpus also can be included.
Another aspect of the invention can include a speech recognition system for converting telegraphic user spoken utterances to text. In that case, the system can include one or more acoustic models. The acoustic models can represent linguistic units for determining one or more word candidates from the telegraphic user spoken utterance. Also included can be one or more language models. The language models can provide contextual information corresponding to the one or more word candidates. Notably, the one or more language models can include one or more telegraphic language models. The speech recognition system further can include a processor which can process the telegraphic user spoken utterances according to the acoustic models and the language models.
Another aspect of the invention can include a machine readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform a series of steps. The steps can include computing confidence scores for identified closed-class words and computing a running average of the confidence scores for a predetermined number of decoded closed-class words. Based upon the running average, the step of selectively enabling telegraphic decoding to be performed can be included. Notably, telegraphic decoding can be enabled in addition to conventional decoding. Also included can be the step of selectively disabling telegraphic decoding based upon the running average.
Another embodiment of the invention can include a machine readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform a series of steps. In that case, the steps can include decoding the received user spoken utterance with a conventional language model resulting in a first word candidate and decoding the received user spoken utterance with an alternate language model resulting in a second word candidate. The alternate language model can be a telegraphic language model. Also included can be the steps of computing a confidence score for the first word candidate and the second word candidate. The step of selecting the word candidate having the highest confidence score also can be included. The first word candidate and the second word candidate can be the same word, but have different confidence scores. Also, if the first word candidate and the second word candidate are not the same word but have the same confidence scores, either the first or the second word candidate can be selected.
Another aspect of the invention can include a machine readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform a series of steps. In that case, the steps can include loading an existing training corpus into a computer system and revising the training corpus by removing closed-class words from the training corpus. The step of developing a telegraphic language model from the revised training corpus also can be included.