For many years, scientists have been trying to find a means to simplify the interface between man and machine. Input devices such as the keyboard, mouse, touch screen, and pen are currently the most commonly used tools for implementing a man/machine interface. However, a simpler and more natural interface between man and machine may be human speech. A device which automatically recognizes speech would provide such an interface.
Potential applications for an automated speech-recognition device include a database query technique using voice commands, voice input for quality control in a manufacturing process, a voice-dial cellular phone which would allow a driver to focus on the road while dialing, and a voice-operated prosthetic device for the physically disabled.
Unfortunately, automated speech recognition is not a trivial task. One reason is that speech tends to vary considerably from one person to another. For instance, the same word uttered by several persons may sound significantly different due to differences in accent, speaking speed, gender, or age. In addition to speaker variability, co-articulation effects, speaking modes (shout/whisper), and background noise present enormous problems to speech-recognition devices.
Since the late 1960's, various methodologies have been introduced for automated speech recognition. While some methods are based on extended knowledge with corresponding heuristic strategies, others rely on speech databases and learning methodologies. The latter methods include dynamic time-warping (DTW) and hidden-Markov modeling (HMM). Both of these methods, as well as the use of time-delay neural networks (TDNN), are discussed below.
Dynamic time-warping is a technique which uses an optimization principle to minimize the errors between an unknown spoken word and a stored template of a known word. Reported data shows that the DTW technique is very robust and produces good recognition. However, the DTW technique is computationally intensive. Therefore, it is impractical to implement the DTW technique for real-world applications.
Instead of directly comparing an unknown spoken word to a template of a known word, the hidden-Markov modeling technique uses stochastic models for known words and compares the probability that the unknown word was generated by each model. When an unknown word is uttered, the HMM technique will check the sequence (or state) of the word, and find the model that provides the best match. The HMM technique has been successfully used in many commercial applications; however, the technique has many drawbacks. These drawbacks include an inability to differentiate acoustically similar words, a susceptibility to noise, and computational intensiveness.
Recently, neural networks have been used for problems that are highly unstructured and otherwise intractable, such as speech recognition. A time-delay neural network is a type of neural network which addresses the temporal effects of speech by adopting limited neuron connections. For limited word recognition, a TDNN shows slightly better result than the HMM method. However, a TDNN suffers from some serious drawbacks.
First, the training time for a TDNN is very lengthy, on the order of several weeks. Second, the training algorithm for a TDNN often converges to a local minimum, which is not the optimum solution. The optimum solution would be a global minimum.
In summary, the drawbacks of existing known methods of automated speech-recognition (e.g. algorithms requiring impractical amounts of computation, limited tolerance to speaker variability and background noise, excessive training time, etc.) severely limit the acceptance and proliferation of speech-recognition devices in many potential areas of utility.
There is thus a significant need for an automated speech-recognition system which provides a high level of accuracy, is immune to background noise, does not require repetitive training or complex computations, produces a global minimum, and is insensitive to differences in speakers.