Speech recognition, often referred to as automatic speech recognition (ASR), is now used widely in different types of apparatuses, such as mobile communication terminals. Speech recognition applications that enable dictation of words and commands are becoming more and more attractive for users as the terminals are provided with increasing computational power and memory.
The idea of mobile dictation is, for example, to provide an alternative way of information (e.g. text) input for personal communication devices with limited size and keyboard facilities. By providing a robust speech recognition system it may be possible to manufacture smaller devices by simply removing the possibility of keyboard input, by not providing a keyboard, or at least minimizing it.
However, ASR technology is far from being perfect and recognition errors will continue to be a problem in the foreseeable future. Therefore, it is important to minimize any impact of incorrect recognition, not least for the convenience of the user.
State-of-the-art embedded speech recognition systems for command and control (e.g., name dialing) can reach a performance level of 95-99%. However, free dictation is a much more demanding task. The average accuracy of current embedded dictation systems is in the range of 80% to 90% at the word level. Many factors may affect performance, like speaking style, noise level and so on. Performance can be improved by limiting the dictation domain (e.g. personal communication style of messages) resulting in a relatively small and accurate language model, and by using the device in an acoustically clean environment, as well as fine-tuning of the recognition engine.
Embedded dictation applications are inherently complex with many user interface and recognition engine related parameters that need optimization. Many of these parameters can be pre-tuned off-line using, e.g., large speech test databases in workstation simulations. However, user interaction related parameters are very difficult to handle that way.
It is known how to utilize embedded dictation that operates in an isolated-word manner so that the user has to leave short pauses, typically 0.5 to 2 seconds, between words. After each word, a list of candidates is shown on the display for a pre-defined timeout period. The user can accept a word by pressing a key on a keypad, or similar, during the timeout or by waiting until the timeout elapses.
A problem, however, becomes apparent when considering the fact that for optimum user experience, the timeout period has to be different for novice and advanced users. Using a too short timeout period for novice users may be frustrating since incorrect words may get accepted before the user could react and select a candidate word from the displayed list of words. On the other hand, using too long timeout period for advanced users slows down the dictation process, or forces the user to press the joystick for each and every word, even if the word candidate is correct. This may be perceived as unnecessary forced interaction with the application and, needless to say, hence inconvenient from the user point of view.
Changing the timeout period may be possible by manual control using a dictation settings menu, depending on the specific implementation. However, such operations are often considered cumbersome and inconvenient. In fact, a typical user may not even be aware of how to manipulate such settings in the terminal.