This invention relates to speech recognition and in particular to speech recognition in the context of low-cost applications.
Generally, a speech recognition device analyzes an unknown audio signal to generate a pattern that contains the acoustically significant information in the utterance. This information typically includes the audio signal power in several frequency bands and the important frequencies in the waveform, each as a function of time. The power may be obtained by use of bandpass filters or fast Fourier transforms (FFT). The frequency information may be obtained from the FFTs or by counting zero crossings in the filtered input waveform.
There are several dimensions along which speech recognition devices can be classified according to their mode of operation. One dimension divides recognizers into those that attempt to recognize unknown phonemes or words and those that attempt sentence recognition. To recognize sentences, a typical prior art technique first analyzes the patterns generated from the input speech waveform to produce a string of words or phonemes. This data is combined with linguistic information--contextual, lexical, syntactic, semantic, etc.--to generate the most likely sentence.
Another dimension distinguishes speech recognizers according to whether they are speaker dependent or speaker independent. In the former case, the recognizer is trained on the user's voice, while in the latter, this requirement is not made. Although speaker dependent recognition generally produces the better result, this improvement is paid for in the cost of the device and the complexity of its use. A major component of the increased cost of a speaker dependent recognition system is the random access memory required to store the user's training output.
Yet another dimension along which speech recognizers can be classified is the recognition algorithm. Typical known algorithms include a type of template matching that compares the unknown pattern with reference patterns, neural networks, and/or hidden Markov models.
None of the speech recognizers described above produce reliable, high accuracy recognition at a cost that is sufficiently low for their wide-spread incorporation in toys, educational learning aids, inexpensive consumer electronic products, etc. This is because these devices generally require a digital signal processor and/or large amounts of random access memory (RAM), the cost of either of which largely excludes the product from these markets.
An additional source of cost in the present day technology is that many speech recognition applications also require speech synthesis. In the prior art, separate electronics is provided to implement speech synthesis and recognition. Furthermore, in consumer devices particularly, all of the speech synthesis and recognition electronics are separate from the electronics used to control the remaining functions of the device.
What is needed is an inexpensive reliable speech recognition device suitable for consumer applications. The device should also incorporate speech synthesis capabilities and control of other functions without the addition of substantial extra hardware.