Speech recognizers and speech recognition systems and methods are known and have been used in a variety of applications. One application is a state of the art telephone network which offers services based upon speech interaction between a telephone subscriber and the telephone network.
In such a telephone network, speech responses by the subscriber are used to directly invoke system operations which previously required key or dial entry. An example of such a service is speech activated auto-dialing.
In this type of dialing, a subscriber is able to access a speech server coupled to a central office switch of the telephone network. The speech server is in turn able to recognize a voice message spoken by the subscriber. A telephone number associated with the recognized message is then transmitted by the speech server to the central office switch. The central office switch then proceeds to interconnect the subscriber with the location of the spoken telephone number, as if the telephone number had been keyed or dialed in conventional fashion by the subscriber.
An integral part of the speech server is a speech recognizer used to recognize a voice message of the subscriber. In a particular application of this invention, the speech recognizer operates on PCM (Pulse Code Modulation) digital signals--a standard for digital telephony representation of a voice message--which are formed from voice samples derived from the voice message. As used herein, the term voice message refers to a single isolated word or a short utterance (e.g., two or three words).
In this type of application, a speech recognizer is usually required to recognize only a limited number of words or voice messages from each subscriber. The recognizer is in most cases initially trained based on repeated entries by the subscriber of the voice messages which are desired to be later recognized. The PCM digital signals representing the samples of a voice message are converted to a linear digital representation of speech and are processed by certain known DSP (Digital Signal Processing) algorithms. As a result of DSP, a "template" or model is developed which is indicative of the corresponding voice message. As used herein, the term "template" refers to a model formed from any features obtained through any known DSP algorithms. Usually, two or more templates for each voice message to be recognized are stored by the speech recognizer.
During recognition, inputted PCM speech undergoes the same conversion and processing as in the training mode and results in a so-called "token" of the inputted utterance (or voice message). As used herein, the term "token" refers to a multi-dimensional feature vector resulting from any kind of DSP and feature extraction algorithms. The token is then compared with the previously stored templates and when a sufficient match is realized, the voice message is recognized as that indicated by the matched template. This completes the recognition process.
One issue of concern with respect to conventional speaker dependent speech recognizers is the training process. A speaker dependent speech recognizer of the type described above delivers to a user only its ability to be trained for the subsequent recognition of voice messages selected by the user. A high quality of training is a necessary condition for a desirable recognition accuracy when the recognizer is subsequently used for recognition.
In general, the training of a speech recognizer to recognize a particular voice message is accomplished by a training algorithm incorporated in the speech recognizer. The amount of training data received from a user to train a speech recognizer so that a particular voice message can be subsequently recognized depends on how many times the user is required to repeat a voice message before the voice message becomes a trained voice message. This number of repetitions is usually a compromise between the desire to have as many repetitions as necessary for reliable training and the requirement to make training a quick and easy procedure for the user. In general, it is not desirable to leave the training data utilized by the training algorithm, e.g., the content or quality of speech supplied for training voice messages, solely to the discretion of the user. Rather, it is advantageous for the training data required from a user to be determined to obtain a high quality of training. It is a shortcoming of currently available training algorithms for speaker dependent, isolated word speech recognizers that the training usually takes place in an unsupervised environment so that only the user has control over the training data used to train the recognizer to recognize a particular voice message.
There are two problems which significantly affect the quality of training, and therefore, affect the subsequent recognition accuracy. One problem is the confusability of similar sounding voice messages that have different meanings. This naturally presents problems during subsequent recognition. A second problem arises when significantly different pronunciations are used by the user for the same voice message. When the training algorithm leaves the training data too much within the user's control, these problems are often not properly accounted for in the training process.
In particular, a training algorithm will often require a user to make several (e.g., two) templates for each voice message to be recognized. When the user is in control of the training process, the user (who does not fully appreciate how the recognition algorithms work) will often use slightly different wording or different pronunciations for the same voice message. For example, in an application where a phone call is activated by a voice message, the user will make one template corresponding to "call mom" and the second template corresponding to "call mother". When two templates are formed in this manner during training, there will be problems in the subsequent recognition process. Each of these utterances will be represented by one template only and a high level of recognition accuracy cannot be guaranteed.
In view of the foregoing, it is an object of the invention to provide a training process or algorithm for a speech recognizer, in particular a speaker dependent isolated word speech recognizer. It is a further object of the invention to provide a training process or algorithm which introduces a certain level of control over the training so that complete discretion is not left to the user. More specifically, it is an object of the invention to provide a training process for a speaker dependent isolated word speech recognizer which
(a) prevents a user from adding a new voice message to the voice messages which can be recognized when the new voice message is identical or very similar to a voice message which can already be recognized, and PA1 (b) prevents a user from using different wordings when training a voice message with a particular meaning and requires a substantially consistent pronunciation for the formation of different templates corresponding to the same voice message.