(Not applicable)
(Not applicable)
1. Technical Field
The present invention relates to speech recognition systems, and in particular, to a method for selectively training in-vehicle speech recognition systems to adapt to the speech characteristics of individual speakers.
2. Description of the Related Art
Speech recognition systems on board automobiles permit drivers and passengers to control various vehicle functions by speaking words and phrases corresponding to voice commands. One or more microphones placed within the passenger cabin receive audio signals representing the spoken words and phrases. Speech engine recognition algorithms employing various acoustic and language modeling techniques are used to process the audio signals and identify a matching voice command contained in one or more stored command grammar sets. The voice command is then transmitted to a suitable control for operating any number of vehicle functions and accessories, such as power windows, locks and climate control devices.
The efficacy of any speech recognition system is largely measured in terms of recognition accuracy, i.e., whether the system correctly matches a voice command to a spoken utterance. Generally, speech recognition is a difficult problem due to the wide variety of speech/phonetic characteristics, such as pronunciations, dialect and diction, of individual speakers. This is especially true for in-vehicle speech recognition systems since vehicles typically carry a number of passengers. Moreover, the acoustic properties within the vehicle cabin can vary due to engine and road noises, for example, from passing traffic and sirens as well as weather conditions such as wind rain and thunder, which makes speech recognition particularly challenging.
Acoustic, lexical and language models are typically included in speech engines to aid in the recognition process by reducing the search space of possible words and to resolve ambiguities between similar sounding words and phrases. These models tend to be statistically based systems and can be provided in a variety of forms. Acoustic models may include acoustic signatures or waveform models of the audio signals corresponding to each command. Lexical and language models typically include algorithms instructing the speech engine as to the command word choice and grammatical structure. For example, a simple language model can be specified as a finite state network, where the permissible words following each word are given explicitly. However, more sophisticated, context specific language models also exist.
To improve recognition accuracy, conventional in-vehicle speech recognition systems permit these models to be adapted to a speaker""s phonetic characteristics by performing a training routine. Typically, such training routines begin with the speaker directing the system to enter a training mode. The system prompts the speaker with a number of predetermined or random voice commands and instructs the speaker to say each command. The system then adapts the entire set of speech commands according to the variance of the spoken words from the models for the corresponding speech commands. Since the entire set of speech commands are being adapted, however, a high number of iterations are required in order to provide the system an adequate sampling of the speaker""s speech characteristics. Typically, such training routines include at least 20-40 command prompt and response iterations.
This technique can be inconvenient and time consuming for the user due to the numerous training command input iterations. The training routine can be particularly distracting to a driver, such that it may be inappropriate for a driver to perform the routine while the vehicle is in motion. Moreover, the above technique can be ineffective for correcting particularly problematic words that are repeatedly mis-recognized. This is because the technique is designed to broadly tune the speech recognition system to a given speaker""s phonetic characteristics.
Accordingly, there is a need for a simple and effective technique for adapting an in-vehicle speech recognition system to correct incorrectly recognized voice commands.
The present invention provides a method for improving the recognition accuracy of an in-vehicle speech recognition system by adapting its speech engine to a speaker""s speech characteristics as needed to recognize a particular voice command and target specific problematic words or phrases. The method employs an N-best matching technique to provide a list of known car commands that most closely match a spoken utterance. When the speaker selects the intended or correct car command from the N-best matches, the spoken utterance is used to adapt the speech engine as needed to automatically recognize this car command.
Specifically, the present invention is a method for the selective speaker adaptation of an in-vehicle speech recognition system used to operate vehicle accessories by voice. The method includes the steps of: receiving from a speaker a spoken utterance having speaker-dependent speech characteristics and relating to one of a set of known car commands; processing the spoken utterance according to a recognition parameter; identifying an N-best set of known car commands matching the processed spoken utterance; outputting the N-best command set to the speaker; receiving speaker input selecting a correct car command from the N-best command set; and adjusting the recognition parameter so that the speech recognition system adapts to the speaker by recognizing as the correct car command a spoken utterance having the speech characteristics of the spoken utterance. The method further includes performing an accessory operation corresponding to the correct car command.
In one aspect of the present invention, the recognition parameter is an acoustic waveform model and the spoken utterance speech characteristics include a speaker-dependent acoustic signature. In this case, the speech engine is adapted by substituting the acoustic signature for the waveform model of the correct car command. Alternatively, the recognition parameter is a phonetic classification set and the speech engine is adapted by altering the phonetic classification set according to the spoken utterance speech characteristics for the correct car command.
In another aspect of the invention, the N-best command set may be displayed on an instrument panel display and the speaker selection input is via an input device. Alternatively, the N-best command set may be output audibly via a loudspeaker, such as in a vehicle audio system, by processing text-to-speech algorithms and/or pre-recorded speech files. In this case, the audible output includes identifiers for each N-best command that the speaker may utter as speaker selection input.
In yet another aspect, the method can include receiving a training mode input from said speaker, such that the outputting, selecting and adapting steps are only performed upon receiving the training mode input. The training mode input can be performed in a variety of ways, including operating a training mode control button and issuing a training mode voice command. The training mode voice command can be a dedicated word or phrase, such as xe2x80x9ctrainxe2x80x9d or xe2x80x9clearn wordxe2x80x9d. Or, it may any spoken utterance in which the accessory operation corresponding to the spoken command as recognized by the speech engine has already been performed. For example, if the speech engine recognizes a spoken phrase as xe2x80x9cdome light onxe2x80x9d when the dome light is already on, it can interpret this as a mis-recognition error and enter the training mode. Moreover, the training mode input can be a spoken utterance repeated in succession, such as xe2x80x9cdome light on . . . dome light onxe2x80x9d. Repeated phases could be deemed training mode input for only selected voice commands that are not typically issued in succession and/or only when the expected accessory operation has already been performed.
In still another aspect, the method of the present invention can include assigning a match probability weighting to each of the known car commands in the N-best command set. Preferably, one of the N-best car commands has a highest match probability weighting, in which case, the adaptation is performed only if the speaker does not select the highest match probability command as the correct car command.
Thus, the present invention provides a simple and quick method of selectively adapting a speech engine to recognize a particular voice command according to the speech characteristics of the speaker. By adapting the speech engine according to the correlation of the spoken utterance to the intended or correct voice command, this method permits the speaker to correct the misrecognition of specific voice commands. Moreover, since it adapts the speech engine to an already spoken utterance, this method can eliminate the need for a lengthy, iterative training routine requiring the speaker to respond to a number of training command prompts.
These and still other advantages of the present invention will be apparent from the description of the preferred embodiments which follow.