The invention relates to systems and methods for speaker independent continuous and connected speech recognition and characteristic sound recognition, and more particularly to sytems and methods for dealing with both rapid and slow transitions between phonemes and characteristic sounds, and for dealing with silence and distinguishing between certain closely related phonemes and characteristic sounds, and for processing the phonemic recognition in real time.
In recent years there has been a great deal of research in the area of voice recognition because there are numerous potential applications for a reliable, low-cost voice recognition system. The following references are illustrative of the state of the art:
______________________________________ USP 3278685 Harper 1966 IBM USP 3416080 Wright 1966 SEC (UK) USP 3479460 Clapper 1969 IBM USP 3485951 Hooper 1969 Private USP 3488446 Miller 1970 Bell USP 3499989 Cotterman 1970 IBM USP 3499990 Clapper 1970 IBM USP 3573612 Scarr 1968 STC USP 3603738 Fecht 1971 Philco USP 3617636 Ogihara 1971 NEC (Japan) USP 3646576 Griggs 1970 Private USP 374214 Newman 1970 NRDC (UK) USP 3770892 Clapper 1972 IBM USP 3916105 McCray 1975 IBM USP 3946157 Dreyfus 1976 Private USP 4343969 Kellet 1982 Trans-Data Assoc ______________________________________
As mentioned in my prior U.S. Pat. No. 4,284,846, many problems still remain to be solved in speech recognition, and include the normalization of speech signals to compensate for amplitude and pitch variations in speech by different persons, obtaining of reliable and efficient parametric representation of speech signals for processing by digital computers, identifying and utilizing the demarcation points between adjacent phonemes, identifying the onset of each voiced pitch cycle, identifying of very ahort duration phonemes, and ensuring that the speech recognition system can adapt to different speakers or new vocabularies.
The system described as a preferred embodiment in my previous U.S. Pat. No. 4,284,846 represents a major step forward in the evolution of speech and sound recognition systems, in that it shows that a system with very little hardware, including a single, low-cost integrated circuit microprocessor, can achieve real time recognition of spoken phonemes. Furthermore, the system described in that patent is relatively speaker-independent.
However, my subsequent research has shown that the system described in U.S. Pat. No. 4,284,846 requires more software than was orignally expected for dealing with gaps of "silence" between phonemesin some ordinary speech. My subsequent research has also shown that more clues are necessary to reliably distinguish between certain closely related phonemes than is indicated in my U.S. Pat. No. 4,284,846. Furthermore, my subsequent research has shown that the considerable variation in the pitch of any typical person's normal speaking voice, and the affect upon the speech waveform of the configuration and position of the speaker's various "articulators", such as the size and shape of the mouth cavity, the size and shape of the nasal cavity, the size and shape and position of the tongue, the size and position of the teeth, and the size and position of the lips of the speaker cause, in some cases, inaccuracies in the "characteristic ratios" described in my U.S. Pat. No. 4,284,846. This makes it more difficult to achieve a comprehensive, completely inclusive, speaker-independent phoneme recognition system than I previously thought to be the case.
Acordingly, there still remains an unmet need for a less expensive, more accurate, more reliable, more speaker-independent, and more pitch-independent voice recognition than is possibly achievable by any device or system or method disclosed in the prior art, other than my own prior patents, presently known to me.
In the area of speaker-dependent voice command recognition systems there are a number of devices presently available. They are capable of receiving, for example, simple word commands and producing corresponding digital command codes which are transmitted to a computer. Typically, such voice command systems must be "trained" to recognize particular command words spoken by a particular speaker. It should be appreciated that an average person can not speak the same word in exactly the same way twice. In fact, there is a great variation in the speech waveforms produced when an average person tries to speak the same word a number of times. Present speaker-dependent voice command recognition systems are not capable of storing digitized speech waveform data for only one utterance of a particular command, and then later reliably recognizing the same word spoken by the same speaker. Therefore, the presently available systems are "trained" by instructing the speaker to speak the desired word into the system's microphone a number of times. The microphone signal for each repetition of the word is amplified and digitized, typically by using zero-crossing techniques, and sometime by using analog to digital converters and processing the resulting digital output. Some of the available systems compare each stored version of that word with the digitized version of a later spoken utterance of that command word to try to match the spoken command with one of the stored versions of it. Various auto-correlation operations are performed to determine if there is a match. Other systems use various techniques to average the digitized data of the numerous utterances of the same command word received during the "training" session, and then compare a later spoken command and with the stored averaged data in attempting to recognize the spoken command. Such prior systems are slow, expensive, and unreliable, and are not yet widely used.