For many years there has been interest in the possibility of having a machine to recognize speech. Discrete-speech recognition devices have been proposed which can recognize individual words separated by distinct periods of silence. These devices have a disadvantage of not being able to recognize naturally spoken continuous speech. Connected speech recognition devices have been proposed which are intended to decipher words in a short phrase or sentence composed of a selected set of words. Most of these devices require that samples of the individual's speech be provided to the device in advance. All such devices have a limitation on the allowable number of words in the vocabulary, when the devices are still intended to be able to operate in real time.
Heretofore, there has not been a device which could recognize the continuous speech of an arbitrary person speaking with virtually any accent and which is not limited as to the allowable number of words in the vocabulary. An essential feature of such a device is that it be able to recognize accurately and reliably the phonemes in continuous speech. There have been several methods proposed for identifying the phonemes in speech. Such methods have involved the comparison of a segment of the acoustic signal with reference templates or models of the phonemes in the language being spoken in order to determine a best match. Reference templates that yield highly accurate phoneme identifications have not been achieved because, on the one hand, it has been burdensome to implement the large number of individual templates required to represent the range of speech which occurs in a population; and, on the other hand, approximating a range with a wieldy number of templates leads to unacceptably high rates of errors in identification.
Vocabulary size has been another barrier to automatically transcribing continuous speech into text. In the prior art, the list of words and the order in which they were allowed to be spoken has been prescribed because uncertainty in phoneme identification has been compensated for by application of grammatic, semantic, and/or syntactic rules to assist in word identification. This approach has led to the employment of very cumbersome and time-consuming network calculations which jeopardize response time within the generally accepted definition of real time response of 0.3 seconds.
Reference is made to David T. Griggs, U.S. Pat. No. 4,435,617, granted Mar. 6, 1984. This patent is similar to some other prior efforts as discussed above, in that the phonemes are intended to be recognized by analog circuits. In addition, the Griggs patent proposes use of "syllabits" which constitute a consonant followed by a vowel, with 377 of these syllabits being employed to make the most likely words. However, this system has the complications and problems as mentioned hereinabove.
Accordingly, a principal object of the present invention is to provide real time translation from speech to a natural language without the difficulties and problems as outlined hereinabove.