This invention relates to speech recognition systems in general and more particularly to a speech recognition system employing templates wherein each of said templates is generated by the selective addition of noise to increase the probability of speech recognition.
As one can ascertain, the art of speech recognition in general has been vastly developed in the last few years and speech recognition systems have been employed in many forms. The concept of recognizing speech recognizes that the information obtained in a spoken sound can be utilized directly to activate a computer or other means. Essentially, the prior art understood that a key element in recognizing information in a spoken sound is the distribution of the energy with respect to frequency. The formant frequencies are those at which the energy peaks are particularly important. The formant frequencies are the acoustic resonances of the mouth cavity and are controlled by the tongue, jaw and lips. For a human listener the determination of the first two or three formant frequencies is usually enough to characterize vowel sounds. In this manner, machine recognizers of the prior art included some means of determining the amplitude or power spectrum of the incoming speech signal. This first step of speech recognition is referred to as preprocessing as it transforms a speech signal into features or parameters that are recognizable and reduces the data flow to manageable proportions. In regard to such, one means of accomplishing this is the measurement of the zero crossing rate of the signal in several broad frequency bands to give an estimate of the formant frequencies in these bands.
Another means is representing the speech signal in terms of the parameters of the filter whose spectrum best fits that of the input speech signal. This technique is known as linear predictive coding (LPC). Linear predictive coding or LPC has gained popularity because of its efficiency, accuracy and simplicity. The recognition features extracted from speech are typically averaged over 10 to 40 miliseconds then sampled 50-100 times per second.
The parameters used to represent speech for recognition purposes may be directly or indirectly related to the amplitude or power spectrum. Formant frequencies and linear predictor filter coefficients are examples of parameters indirectly related to the speech spectrum. Other examples are cepstral parameters and log-area ratio parameters. In these and most other cases the speech parameters used in recognition are, or can be, derived from spectral parameters. This invention is related to the selective addition of noise to spectrum parameters generating speech recognition parameters. This invention applies to all forms of speech recognition which use speech parameters which are, or can be, derived from spectral parameters.
In any event, one of the most popular approaches to speech recognition in the past has been the use of templates to provide matching. In this approach words are typically represented in the form of parameter sequences. Recognition is achieved by using a predefined similarity measure to compare the unknown template token against stored templates. In many cases, time alignment algorithms are used to account for variability in the rate of production of words. Thus, template matching systems can achieve high performance with a small set of accoustically distinct words. Some researchers have questioned the ability of such systems to ultimately make fine phonetic distinction among the wide range of talkers. See for example an article entitled "Performing Fine Phonetic Distinctions: Templates versus Features" in Variability and Invariance in Speech Processes" by J. S. Perkel and D. H. Klatt, editors, Hillsdale, N.J.: Lawrence Erlbaum Associates, 1985, authors R. A. Cole, R. M. Stern and M. J. Lasry.
Thus as an alternative, many people propose a feature-based approach to speech recognition in which one must first identify a set of acoustic features that capture the phonetically relevant information in the speech signal. With this knowledge, algorithms can be developed to extract the features from the speech signal. A classifier is then used to combine the features and arrive at a recognition decision. It is argued that a feature-based system is better able to perform fine phonetic distinctions than a template matching scheme and thus is inherently superior. In any event, template matching is a technique which is often used in pattern recognition whereby an unknown is compared to prototypes in order to determine which one it most closely resembles.
By this definition, feature-based speech recognitions that use multi-variate Gaussian models for classification also perform template matching. In this case, the statistical classifier merely uses a feature vector as a pattern. Similarly, if one regards spectrum amplitude and LPC coefficients as features then spectrum based techniques are feature-based as well.
In regard to use, template matching and feature-based systems really represent different points along a continuum. One of the most serious problems with the template matching approach is the difficulty of defining distance measures that are sensitive enough for fine phonetic distinctions but insensitive to the irrelevant spectral changes.
One manifestation of this problem is the excessive weight given to unimportant frame-to-frame variations in the spectrum of a long steady-state vowel. Thus the prior art, aware of such problems, has proposed a number of distance metrics that are intended to be sensitive to phonetic distances and are insensitive to irrelevant acoustic differences. See for example an article entitled "Prediction of Perceived Phonetic Distance from Critical Band Spectra" by D. H. Klatt, published in the Procedures ICASSP-82, IEEE Catalog No. CH1746-7, pages 1278-1281, 1982.
In any event, in order to gain a better understanding of speech communication systems, reference is made to Proceedings of the IEEE, November 1985, Volume 73, No. 11, pages 1537-1696. This issue of the IEEE presents various papers regarding man-machine speech communications systems and gives good insight to the particular problems involved. As one will understand, a major aspect in regard to any speech recognition system is the ability of the system to perform its allocated task--namely, to recognize speech in regard to all types of environments.
As indicated, many speech recognition systems utilize templates. Essentially, such systems convert utterances into parameter sequences which are stored in the computer. Sound waves travel from a speaker's mouth through a microphone to an analog-to-digital converter where they are filtered and digitized along with, for example, background noise, which may be present. The digitized signal is then further filtered and converted to recognition parameters, in which form it is compared with stored speech templates to determine the most likely choice for the spoken word. For further examples of such techniques, reference is made to the IEEE Spectrum, Vol. 24 No. 4, published April 1977. See an article entitled "Putting Speech Recognizers to Work" pages 55-57 by T. Wallich.
As one can ascertain from that article, the utilization of speech recognition systems are constantly being expanded in regard to application and there are many models which are already available which are employed for various applications as indicated in that article. The formation of templates is also quite well known in the prior art. Such templates are employed with many different types of speech recognition systems. One particular type of system is known as "A Key word recognition system" as described in the publication entitled "An Efficient Elastic-Template Method for Determining Given Words in Running Speech" by J. S. Bridle, "British Accoustical Society Spring Meeting", pages 1-4, April 1973. In this article the author discusses the derivation of elastic templates from a parametric representation of spoken example of key words to be detected. A similar parametric representation of the incoming speech is continuously compared with these templates to measure the similarity between the speech and the key words from which the templates were derived.
A word is determined by the recognizer to have been spoken when a segment of the incoming speech is sufficiently similar to the corresponding template. The word templates are termed "elastic" because they can be expanded and compressed in time to account for variations in the talking speed and local variations in the rate of word pronunciation.
Key word recognition is similar to conventional speech recognition. In the former, templates are stored only for "key" words to be recognized within a context of arbitrary words or sounds, whereas in the latter templates are stored for all the speech anticipated to be spoken. All such systems, whether they be key word recognition systems or conventional speech recognition systems that employ templates encounter the same problems--namely the inability of the system to recognize the spoken word is uttered for example by different individuals or as uttered by the same individual under different conditions.
It is therefore an object of the present invention to provide apparatus and methods for an improved automatic speech recognition system.
It is a further object of the present invention to provide a speech recognition system which automatically adapts to a noisy environment.
As will be further understood from the appended specification, most speech recognition systems suffer from a degraded operation in the presence of noise. This degradation is particularly severe when the templates have been derived from speech with little or no noise, or with a noise of different quality from that present when recognition is attempted. Previous methods of reducing this difficulty require the production of new templates in the presence of the new noise. This production necessitates the collection of new speech and noise. In this particular system there is an analytical addition of noise to templates which permit an improved recognition probability thereby substantially improving the system performance, and it does not require collecting new speech for template generation.