1. Field of the Invention
The invention relates to systems and methods for speech recognition, and more particularly to systems for recognition of specific sound corresponding to phonemes and transitions therebetween in ordinary spoken speech.
2. Description of the Prior Art
In recent years there has been a great deal of research in the area of voice recognition and speech recognition because there are numerous potential applications for a reliable, low cost voice recognition system or speech recognition system. A few types of voice recognition units are presently commercially available, costing in the range from $10,000 to $100,000 and having capability of recognizing a limited number of isolated spoken words. A few systems have the capability of recognizing small groups of words spoken without pauses between words, as mentioned in the article "Voice-Recognition Unit For Data Processing Can Handle 120 Words," Electronics, Page 69, Apr. 13, 1978.
The present state of the art in this area is reviewed in "Speech Recognition by Machine: A Review," by D. Raj Reddy, Proceedings of the IEEE, Apr. 1, 1976, Pages 501-531. More detailed information in particularly relevant areas of the speech recognition area are described in the following articles: "Algorithm for Pitch Extraction Using Zero-Crossing Interval Sequence" by Nezih C. Geckinli and Davras Yavuz, IEEE Transactions on Acoustic Speech and Signal Processing, Volume ASSP-25, Number 6, December, 1977; "Continuous Speech Recognition by Statistical Methods" by Frederick Jelinek, Proceedings of the IEEE, Volume 64, Number 4, April, 1976; "Pseudo-Maximum-Likelihood Speech Extraction," by David H. Friedman, IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume ASSP-25, Number 3, June, 1977; "Practical Applications of Voice Input to Machines," by Thomas B. Martin, Proceedings of the IEEE, Volume 64, Number 4, April 1976; "On the Use of Autocorrelation Analysis for Pitch Detection," by Lawrence R. Rabiner, IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume ASSP-25, Number 1, February, 1977; and "Communication Aids for People with Impaired Speech and Hearing," by A. F. Newell, Electronics and Power, October 1977.
The complexity of prior systems and methods for speech recognition have been extremely complex and expensive because of the complexity of the processes of understanding human speech. Workers in the art have utilized various sources of knowledge that all people subsconsciously use, including knowledge of a particular language, a particular environment, and the context of a particular communication in order to understand a sentence. These sources of knowledge include characteristics of speech sounds (phonetics), variability in pronunciations (phonology) the stress and intonation patterns of speech (prosodics), the sound patterns of words (lexicon), the grammatical structure of language (syntax), the meaning of words and sentences (semantics), and the context of conversation (pragmatics). Although the "programmed" computer-like mind of a mature human being is capable of processing all of these various sources of knowledge in order to recognize speech, the present state of the art requires tremendously expensive computer hardware, including large amounts of memory and software to store the data and algorithms necessary to achieve even limited understanding of isolated words and short groups of "connected" words.
The main problems involved in speech recognition include normalization of speech signals to compensate for amplitude and pitch variations in human speech, obtaining reliable and efficient parametric representation of speech signals for processing by digital computers, ensuring that the system can adapt to different speakers and/or new vocabularies, and determining the similarity of computed parameters of received speech with stored speech parameters. Known systems involve digitizing and analyzing incoming speech signals to obtain parametric representation thereof. Various complex schemes have been devised for detecting the beginnings and ends of various sounds, words, etc. Techniques for normalizing with respect to amplitude and frequency to obtain a normalized pattern are known. In most known speech recognition systems, reference patterns are "learned," stored in computing systems, and compared to the normalized unknown signal patterns. When a matching is found between such unknown and stored signal patterns, output signals are produced, which signals cause printing, display or other electromechanical action representing the incoming speech.
The most common method of digitizing speech has been by means of pulse code modulation techniques, which divide an analog signal into a predetermined number of "segments." Previous systems typically filter the speech input into a relatively large number of channels to isolate the various frequency components, each of which is pulse code modulated. Each increment of each channel waveform requires a digital word to be stored, so large amounts of temporary memory storage and digital processing have been required. Specialized algorithms have been developed to recognize "formants" (which are spectral regions of high intensity sound) from the digital data obtained from the various frequency channels. These algorithms have been developed to recognize consonants, vowels, liquid consonants, and sharp transient sounds represented by such data. Statistical techniques have also been utilized to analyze the data obtained from the spectral filtering and pulse code modulation of the incoming speech signals.
The previous speech recognition systems and methods involve limited vocabularies, since the amount of computer hardware and software involved for recognition of large numbers of words and connections of words is prohibitive. This limitation requires substantially differently programmed machines for different applications, since the most commonly used words vary widely among different trades and professions.
In short, there is a great presently un-met need for a reliable, flexible, and low-cost system and method for speech recognition.