Speech recognition is becoming more common in electronic devices. With a mobile terminal that has a multimodal interface, a visual phone display enhanced with speech recognition capabilities, not only can a user use voice commands to activate certain phone functions, the user can also input text, such as an SMS (short message service) entry, by dictation. Such a device uses either a local automatic speech recognition (ASR) engine to process the speech or sends the speech to a remote ASR engine residing in the network. The speech recognition engine for dictation usually uses a very large grammar that includes tens of thousands of words to allow for a reasonable range of content and scope for the dictated text. For example, the user may like to send a cooking recipe, or to express a political viewpoint.
It is quite common after dictation that the user would wish to edit the text as recognized and transcribed by the speech recognition engine, either to correct inaccurate recognition results or to make content changes. In general, a terminal device does not have a very large memory. The dictation and editing processes both require a very large grammar, rendering it impractical in a terminal device.
It should be noted that “vocabulary”, as used in this disclosure, is referred to as a list of recognized words or phrases, and a subset of the vocabulary is referred to as “grammar”. In addition to words and phrases, the grammar may contain editing rules and commands.
In a desktop or laptop electronic device, a pointing device such as a mouse, a joystick or a touch pad, is commonly used to locate the word or words in text to be edited. In the terminal device, such a pointing device may be impractical and is thus rarely provided. On a phone pad, arrow keys are typically provided for locating the letter in the text to be edited. However, moving the cursor to the editing location using arrow keys is slow and inconvenient. Thus, it is advantageous and desirable to provide a method and system for text editing using voice commands.
In order to avoid using a large grammar for speech recognition, Masters (U.S. Pat. No. 6,301,561) discloses a discrete speech recognition system for use in selecting radio stations, wherein a small default grammar having a small number of first tier words or utterances, each of which represents a subset of words or utterances of the second tier. Each of the second tier words or utterances represents a subset of words or utterances of the third tier, and so on. When one of the first tier words is selected by a user by voice, a plurality of words or utterances in the second tier subset represented by the selected first tier word are added to the grammar, thereby enlarging the grammar. When one of the second tier words is further selected by the user by voice, a plurality of words or utterances in the third tier subset represented by the selected second tier word are further added to the grammar, thereby further enlarging the grammar. The words or utterances of the second and third tiers are stored in a vocabulary that has a complete list of pre-defined utterances that are recognizable by a speech recognition engine. As such, the grammar that is actually used for carrying a function includes only a small portion of the pre-defined utterances in the vocabulary. While the speech recognition, as disclosed in Masters, is useful in reducing the time needed for speech recognition by keeping the grammar small, its usefulness is limited to a certain application, such as selecting radio stations, where a small set of pre-defined words or utterances identifies the cities and the broadcasting frequencies in a limited vocabulary are sufficient to suit the purposes. However, this type of limited vocabulary is usually insufficient for editing text, the scope and content of which is difficult to predict.
Thus, it is advantageous and desirable to provide a method and a system for text editing in a small electronic device where memory requirements do not allow a large grammar to be implemented in the device.