This invention pertains generally to speech recognition, and more particularly to methods of recognizing non-standard speech.
Methods of recognizing and electronically transcribing human speech are known in the art. (See, for example, The HTK Book, Version 2.1, Steve Young et al, Cambridge University Technical Services Ltd., March 1997, Chapter 1.) They are generally based on storing mathematical models of spoken words, converting incoming utterances into mathematical models, and attempting to match the models of incoming utterances with stored models of words.
A well known application of this technology is a dictation program for a personal computer (PC), which allows a user to create a text file by dictating into a microphone, rather than by typing on a keyboard. Such a program is typically furnished to the user with associated audio hardware, including a circuit board for inclusion in the user""s PC and a microphone for connection to the circuit board.
Typically, a user newly acquiring a dictation program xe2x80x9ctrainsxe2x80x9d it (i.e., spends several hours dictating text to it.) The program uses the training speech stream for two purposes: i) to determine the spectral characteristics of the user""s voice (as delivered through the particular supplied microphone and circuit board) for its future use in converting the user""s utterances to mathematical models; and ii) to determine words spoken by the particular user that the program has difficulty matching with its stored mathematical models of words.
A speech-recognition program, such as a dictation program, is typically supplied with a library of stored word models derived from the speech of a large number of speakers. These are known as speaker-independent models. For most users, there are some words that do not match the speaker-independent models. For some users, this may be because of accents, regional speech variations, or vocal anomalies. Such users will be referred to herein as xe2x80x9cnon-standard usersxe2x80x9d.
For words of a particular user, identified during the training phase as difficult to reliably match against speaker-independent models, the dictation program xe2x80x9clearnsxe2x80x9d (i.e., derives and stores) word models from the particular user. These are known as speaker-dependent models or user-trained models. The user-trained model for a word is stored in place of the original speaker-independent word, which is no longer used for recognizing the particular user""s speech. Non-standard users typically require a greater number of user-trained models than standard users.
An emergent application of speech recognition is in voice messaging systems. The traditional means for a user to access such a system is to dial in by telephone, and request message services by pressing keys on the telephone""s keypad, (e.g., xe2x80x9c1xe2x80x9d might connote PLAY, xe2x80x9c2xe2x80x9d might connote ERASE, etc.). The user may first be required to provide an identification of himself and enter a password, or the system may assume an identity for the user based on the extension from which he calls.
Applications are emerging wherein a user operates the voice messaging system by voice commandsxe2x80x94e.g., by saying the words PLAY, ERASE, etc., rather than by pressing code keys on the keypad. To initiate a call, a user might speak the called party""s number or name rather than xe2x80x9cdialxe2x80x9d the number by pressing keypad digits.
There are some difficulties encountered in speech recognition in a voice messaging system that are not encountered in a dictation system: i) users would find it onerous to expend several hours training a voice messaging system; ii) unlike the single microphone and audio circuit board of a dictation system, users of a voice messaging system might call the system from many different telephone instruments which might connect over paths differing in quality from call to call, and which might use different kinds of networks from call to call. These difficulties compound the difficulties with recognition of utterances from non-standard users.
An approach that has been tried to aid the recognition of utterances by non-standard users is to regenerate the speaker-independent models, including the speech of one or more non-standard users along with the previous sampling of users. This is time-consuming and costly, and may actually degrade the models.
Another approach that has been tried is to eliminate the speaker-independent models and match user utterances against a speaker-dependent set of word models specifically created for each non-standard user. This approach, although feasible with the limited vocabulary that may be required in a voice messaging system, does not take advantage of the large amount of work that has been done in the course of preparing speaker-independent models in the areas of modeling the audio characteristics of various speech transmission media (e.g., telephone lines), or in modeling the co-articulation that occurs in streams of continuous speech.
There is thus a need for a speech recognition system that is based on a speaker-independent set of stored words but which can adapt in a speaker-dependent manner to a non-standard speaker without a long training period.
Accordingly it is an object of the present invention to provide improved recognition of utterances from a non-standard speaker.
It is a further object of the present invention to provide a speech recognition system based on a speaker-independent set of stored words which can adapt in speaker-dependent manner to utterances from a non-standard speaker.
It is a further object of the present invention to provide speech recognition that does not require a long training period.
It is a further object of the present invention to provide reliable speech recognition of user utterances in conjunction with a large variety of transmission media.
These and other objects of the invention will become apparent to those skilled in the art from the following description thereof.
In accordance with the teachings of the present invention, these and other objects may be accomplished by the present system of speech recognition in which an incoming audio signal is compared against stored models of words, reporting as words portions of the audio signal matching stored models, practiced with the present method of providing a set of stored word models derived from utterances of many users and for use by all users, and providing for further use by certain users second sets of stored word models, each set derived from the utterances of one of the certain users and for use only in association with audio signal from that one of the certain users. A portion of incoming audio signal matching a stored model from either set is reported as the corresponding word.