1. Field of the Invention
This invention relates generally to the field of speech recognition systems, and in particular, to speech recognition enrollment for non-readers and displayless devices.
2. Description of Related Art
Users of speech recognition programs need to enroll, that is provide a sample for processing by the recognition system, in order to utilize the speech recognition system with maximum accuracy. When a user can read aloud fluently, it is easy to collect such a sample. When the user cannot read fluently for any reason, or when the speech system does not provide for a display device, collecting such a sample has thus far not been practical. Speech recognition systems can be implemented in connection with telephone and centralized dictation systems, which need not have display monitors as part of the equipment.
Recent years have brought significant improvements to speech recognition software. Speech recognition software, also referred to as a speech recognition engine, constructs text from the acoustic signal of a user""s speech, either for purposes of dictation or command and control. Current systems sometimes allow users to speak to the system using a speaker-independent model to allow users to begin working with the software as quickly as possible. However, recognition accuracy is best when a user enrolls with the system.
During normal enrollment, the system presents text to the user, and records the user""s speech while the user reads the text. This approach works well provided that the user can read fluently. When the user is not fluent in the language for which the user is enrolling, this approach will not work.
There are many reasons why a user might be a less than fluent. The following list is exemplary: the user can be a child who is just beginning to read; the user can be a child or adult having one or more learning disabilities that make reading unfamiliar material difficult; the user can be a user who speaks fluently, but has trouble reading fluently; the user can be enrolling in a system designed to teach the user a second language; and, the user can be enrolling in a system using a device that has no display, so there is nothing to read.
There is a long-felt need to provide speech recognition enrollment for non-readers and for speech systems without display devices.
An enrollment system must have certain properties in addition to those in systems for fluent readers in order to support users who are non-readers and users without access to display devices. In accordance with the inventive arrangements, the most important additional property is an ability to read the text to the user before expecting the user to read the text. This can be accomplished by using text-to-speech (TTS) tuned to ensure that the audible output faithfully produces the words with the correct pronunciation for the text, or by using recorded audio. Given adequate system resources, recorded audio is presently preferred as sounding more natural, but in systems with limited resources, for example handheld devices in a client-server system, TTS can be a better choice.
Thus, the long-felt need of the prior art is satisfied by providing the enrollment text to the user via an audio channel, with adjustments to the standard user interface to provide for an easy-to-understand sequence of events.
A method for enrolling a user in a speech recognition system without requiring reading, in accordance with the inventive arrangements, comprises the steps of: generating an audio user interface having an audible output and an audio input; audibly playing a text phrase; audibly prompting the user to speak the played text phrase; repeating the steps of audibly playing the text phrase and audibly prompting the user to speak, for a plurality of further text phrases; and, processing enrollment of the user based on the audibly prompted and subsequently spoken text phrases.
The method can further comprise the step of audibly playing a further one of the plurality of further text phrases only if the spoken phrase was received.
The method can further comprise the step of repeating the steps of audibly playing the text phrase and audibly prompting the user to speak for the most recently played text phrase if the spoken text phrase was not received.
The method can further comprise the step of audibly prompting the user, prior to the audibly playing step, not to speak while the text phrase is played.
The method can further comprise the step of generating audible user-progress notifications during the course of the enrollment.
The method can further comprise the step of audibly prompting the user in a first voice and playing said text phrases in a second voice.
The method can comprise the step of audibly playing at least some of the text phrases from recorded audio, audibly playing at least some of the text phrases with a text-to-speech engine, or both. Similarly, the user can be audibly prompted from recorded audio, with a text-to-speech engine, or both.
The method can further comprise the steps of: generating a graphical user interface concurrently with the step of generating the audio user interface; and, displaying text corresponding to the text phrases and to the audible prompts.
The method can further comprise the steps of: displaying a plurality of icons for user activation; and, selectively distinguishing different ones of the plurality of icons at different times by at least one of: color; shape; and, animation.
A computer apparatus programmed with a set of instructions stored in a fixed medium, for enrolling a user in a speech recognition system without requiring reading, in accordance with the inventive arrangements, comprises: means for generating an audio user interface having an audible output and an audio input; means for audibly playing a text phrase; and, means for audibly prompting the user to speak the played text phrase.
The apparatus can further comprise means for generating audible user-progress notifications during the course of the enrollment.
The means for audibly playing the text phrases can comprise means for playing back prerecorded audio, a text-to-speech engine, or both.
The apparatus can further comprise: means for generating a graphical user interface concurrently with the audio user interface; and, means for displaying text corresponding to the text phrases and to the audible prompts.
The apparatus can also further comprise: means for displaying a plurality of icons for user activation; and, means for selectively distinguishing different ones of the plurality of icons at different times by at least one of: color; shape; and, animation.