It is generally recognized that man-machine interaction can be enhanced by the ability to communicate audibly, or orally. A variety of interfaces have been developed, including input devices which identify spoken words and output devices which produce synthesized speech. While significant advances have been made with regard to output devices, which respond to well-defined signals, input devices have posed more difficult problems.
Such input devices must convert spoken utterances, i.e. letters, words, or phrases, into the form of electrical signals, and must then process the electrical signals to identify the spoken utterances. By way of example: acoustic signals constituting spoken utterances may be sampled at fixed intervals; the pattern formed by a succession of sampled values may then be compared with stored patterns representing known spoken utterances; and the known spoken utterance represented by the stored pattern which matches the pattern of sampled values most closely is assumed to be the actual spoken utterance. The input devices which have already been proposed could, in theory, function with a high degree of reliability. However, in the present state of the art, they are operated by programs which entail long processing times that prevent useful results from being achieved in acceptably short time periods.
One commercially available program for recognizing spoken utterances is marketed by Lernout and Hauspie Speech Products U.S.A., Inc., of Woburn, Mass. under the product name CSR-1000 Algorithm. This company also offers a key word spotting algorithm under the product name KWS-1000 and a text-to-speech conversion algorithm under the product name TTS-1000. These algorithms are usable on conventional PCs having at least a high-performance 16 bit fixed or floating DSP processor and 128 KB of RAM memory.
The CSR-1000 algorithm is supplied with a basic vocabulary of, apparently, several hundred words each stored as a sequence of phonemes. A spoken word is sampled in order to derive a sequence of phonemes. The exact manner in which such sequence of phonemes is processed to identify the spoken word has not been disclosed by the publisher of the program, but it is believed that this is achieved by comparing the sequence of phonemes derived from the spoken word with the sequence of phonemes of each stored word. This processing procedure is time consuming, which probably explains why the algorithm employs a vocabulary of only several hundred words.
It would appear that the CSR-1000 algorithm could be readily configured to recognize individual spoken letters.
Speech recognition of large isolated word vocabularies of 30,000 words or more requires that the utterances be broken into phonemes or other articulatory events or, alternatively, that the user verbally spell the word, in which case, his utterances of the letters of the alphabet are recognized by their phoneme content and then the sequence of letters is used to identify the unknown word. In any large vocabulary system, both methods are needed to insure accuracy. The user would first attempt to have the word recognized by simply saying the word. If this was unsuccessful, the user would then have the option of spelling the word.
A problem occurs with spelling, however, because the English alphabet is not easily recognized by speech recognition system. For example, almost all recognizers have trouble distinguishing the letter "B" from the letter "P", the letter "J" from "K", the letter "S" from the letter "F" and so on. In fact, most of the alphabet consists of single syllable utterances which rhyme with some other utterance. Similarly, many phonemes which sound alike can be mis-recognized. Clearly, it is necessary for a speech recognition system to deal with the errors caused by rhyming letters or phonemes.