Speech recognition has been limited, even in the face of substantial progress, by the accuracy of such recognition, the speed of recognition and by the resources needed to perform such recognition. An aspect of the accuracy has been the errors due to the variation of language and pronunciation of the speakers, i.e. the dialects. Another aspect of the accuracy has been the errors due to variability in the apparatus by which speech is captured and transmitted, i.e. the channels (including noise). One solution is to have many models for as many dialects as possible. But with more models the recognition is slower and more resources, e.g. memory, computer speed and power, are needed.
Although at present it is possible to design systems that perform nearly 100% accuracy in real time for one or a few speakers, a persistent object in the art is to provide high accuracy recognition of utterances made in dissimilar manners, or under dissimilar circumstances, or surrounding, or with dissimilar electronics (microphones, computers, etc.).
A paper entitled, State of the Art in Continuous Speech Recognition, was published in the Proceedings of the National Academy of Science, USA, Vol. 92, pp. 9956-9963, October 1995, authored by John Makhoul and Richard Schwartz. This paper is hereby incorporated by reference herein as if laid out in full. The authors wrote the paper under the auspices of BBN Systems and Technology, Cambridge Mass., the same assignee of the present patent. The paper discloses three major factors in speech recognition, linguistic variability, speaker variability and channel variability. Channel variability includes the effects of background noise and the transmission apparatus, e.g. microphone, telephone, echoes, etc. The paper discusses the modelling of linguistic and speaker variations. An approach to speech recognition is to use a model, a logical finite-state machine where transitions and outputs are probabilistic, to represent each of the groups of three (or two) phonemes found in speech. The models may have the same structure but the parameters in the models are given different values. In each model there is a hidden Markov model (HMM). HMM is a statistical artifact that is well discussed in the above paper and the references listed therein, and is not be described in depth herein. FIG. 5 of the paper, reproduced herein as prior art FIG. 1, describes an approach to speech recognition. The system is trained by actual speakers articulating words continuously. The audio signal is processed and features are extracted. The signal is often smoothed by filtering by hardware or by software (if digitized and stored), followed by mathematical operations on the resulting signal to form features which are computed periodically, say every 10 milliseconds or so. Continuous speech is marked by sounds or phonemes that are connected to each other. The two adjacent phonemes on either side of a given phoneme have a major effect, referred to as co-articulation, on the articulation of the center phonemes. Triphoneme is the name given to the different articulation of a given phoneme due to the affects of these side phonemes. The continuous speech is divided into discrete transformed segments that facilitate the several mathematical operations. Many types of features have been used including, time and frequency masking, taking of inverse Fourier transforms resulting in a mathematical series of which the coefficients are retained as a feature vector. The features are handled mathematically as vectors to simplify the training and recognition computations. Other features may include volume, frequency range, and amplitude dynamic range. Such use of vectors is well known in the art, and reference is found the Makhoul and Schwartz paper on page 9959 et seq. The spoken words used in the training are listed in a lexicon and a phonetic spelling of each word is formed and stored. Phonetic word models using HMMs are formed from the lexicon and the phonetic spellings. These HMM word models are iteratively compared to the training speech to maximized the likelihood that the training speech was produced by these HMM word models. The iterative comparing is produced by the Baum-Welch algorithm which is guaranteed to converge to form a local optimum. This algorithm is well known in the art as referenced in the Makhoul and Schwartz paper on page 9960. A grammar is established and with the lexicon a single probabilistic grammar for the sequences of phonemes is formed. The result of the recognition training is that a particular sequence of words will corresponds with a high probability to a recognized sequence of phonemes. Recognition of an unknown speech begins with extracting the features as in the training stage. All word HMM model sequences allowed by the grammar are searched to find the word (and therefore the triphoneme) sequence with the highest probability of generating that particular sequence of feature vectors. Prior art improvements have included development of large databases with large vocabularies of speaker independent continuous speech for testing and development. Contextual phonetic models have been developed, and improved recognition algorithms have been and are being developed. Probability estimation techniques have been developed and language models are being improved. In addition, computers with increased speed and power combined with larger, faster memories have improved real time speech recognition. It has been found that increased training data reduces recognition errors, and tailored speaker dependent training can produce very low error rates.
The portion of the Makhoul and Schwartz paper on page 9962, labeled Adaptation, describes incremental improvements in speaker independent systems. The problems and limitations of a new dialect are addressed. The paper states that "incremental adaptation could require hours . . . and . . . patience . . . before the performance becomes adequate." It is suggested, in the next paragraph, that a short training session may be used to transform an existing model for a new speaker. The improvement is not quantified, however. But, with all the advances in speech recognition, limitations on computing resources persist, multiple models still consume large amounts of memory, and fast, powerful computers are needed to recognize continuous, real time speech, and different dialects remain a problem. The present invention is directed to these problems.
It is an object of the present invention to provide a speech recognition system with a large number of models for dialects and/or channels (hereinafter "channel" is inclusively defined as the audio path, speech impediments, noise and acoustic suroundings, the electronics, wireless paths, wire paths) without a proportional increase in computing power and memory.
An object of the present invention is to provide models for speech dialects and/or channels which produce high accuracy in real time for a large variety of speakers.
It is yet another object of the present invention to provide an automatic selection of the best model for use with different speakers in real time.
It is another object of the present invention to provide a more accurate speech recognition system for discontinuous and continuous speech.