Methods of recognizing and electronically transcribing human speech are known in the art. (See, for example, The HTK Book, Version 2.1, Steve Young et al, Cambridge University Technical Services Ltd., March 1997, Chapter 1.) They are generally based on storing mathematical models of spoken words, converting incoming utterances into mathematical models, and attempting to match the models of incoming utterances with stored models of words.
A well known application of this technology is a dictation program for a personal computer (PC), which allows a user to create a text file by dictating into a microphone, rather than by typing on a keyboard. Such a program is typically furnished to the user with associated audio hardware, including a circuit board for inclusion in the user's PC and a microphone for connection to the circuit board.
Typically, a user newly acquiring a dictation program “trains” it (i.e., spends several hours dictating text to it.) The program uses the training speech stream for two purposes: i) to determine the spectral characteristics of the users voice (as delivered through the particular supplied microphone and circuit board) for its future use in converting the user's utterances to mathematical models; and ii) to determine words spoken by the particular user that the program has difficulty matching with its stored mathematical models of words.
A speech-recognition program, such as a dictation program, is typically supplied with a library of stored word models derived from the speech of a large number of speakers. These are known as speaker-independent models. For most users, there are some words that do not match the speaker-independent models. For some users, this failure to match the models may be because of accents, regional speech variations, or vocal anomalies. Such users will be referred to herein as “non-standard users”.
For words of a particular user, identified during the training phase as difficult to reliably match against speaker-independent models, the dictation program “learns” (i.e., derives and stores) word models from the particular user. These are known as speaker-dependent models or user-trained models. The user-trained model for a word is stored in place of the original speaker-independent word, which is no longer used for recognizing the particular user's speech. Non-standard users typically require a greater number of user-trained models than standard users.
An emergent application of speech recognition is in voice messaging systems. The traditional means for a user to access such a system is to dial in by telephone, and request message services by pressing keys on the telephone's key pad, (e.g., “1” might connote PLAY, “2” might connote ERASE, etc.). The user may first be required to provide an identification of himself and enter a password, or the system may assume an identity for the user based on the extension from which he calls.
Applications are emerging wherein a user operates the voice messaging system by voice commands—e.g., by saying the words PLAY, ERASE, etc., rather than by pressing code keys on the keypad. To initiate a call, a user might speak the called party's number or name rather than “dial” the number by pressing keypad digits. Typically, a manufacturer defined default set of voice commands may be uttered by users in to operate the system. This set of commands must typically be learned by the user, to allow the user to effectively operate the system. This learning is often quite cumbersome for users, who, as a result, may not fully utilize available commands and features. This learning difficulty is compounded by the fact that each manufacturer uses its own set of commands. A user's migration to a new system is thus often accompanied with a need to learn a new set of commands.
As well, there are difficulties encountered in recognizing speech in a voice messaging system that are not encountered in a dictation system including, for example: i) users may find it onerous to expend several hours training a voice messaging system; ii) unlike the single microphone and audio circuit board of a dictation system, users of a voice messaging system might call the system from many different telephone instruments which might connect over paths differing in quality from call to call, and which might use different kinds of networks from call to call; and iii) for many users, the default set of commands used to navigate through the options available in a voice messaging system are not intuitive. These difficulties compound the difficulties with recognition of utterances from non-standard users.
An approach that has been tried to aid the recognition of utterances by non-standard users is to regenerate the speaker-independent models, including the speech of one or more non-standard users along with the previous sampling of users. This is time-consuming and costly, and may actually degrade the models.
Another approach that has been tried is to eliminate the speaker-independent models and match user utterances against a speaker-dependent set of word models specifically created for each non-standard user. This approach, although feasible with the limited vocabulary that may be required in a voice messaging system, does not take advantage of the large amount of work that has been done in the course of preparing speaker-independent models in the areas of modeling the audio characteristics of various speech transmission media (e.g. telephone lines), or in modeling the co-articulation that occurs in streams of continuous speech.
There is thus a need for a speech recognition system that is based on a speaker-independent set of stored words but which can adapt in a speaker-dependent manner to a non-standard speaker without a long training period.