This invention relates to a system for pattern recognition that is particularly useful for recognizing speech and handwriting.
Speech recognition may be defined as the extraction of a sequence of symbols from a continuous stream of auditory data. In handwriting recognition, such sequence of symbols is extracted from visual data. By using this invention, a machine recognizes speech or handwriting as competently as humans.
Personal experience as well as linguistic theory give us the strong impression that certain identifiable sounds, which we call phonemes, form the basis of the symbolic structure that is language. The use of phonetic transcription of speech is not restricted to academic purposes. It is pervasively practised in stenograph reporting in courts of law and congressional hearings.
However, machine recognition of phonemes is notoriously difficult. Contextual effects, known as coarticulation, variability within one speaker, as well as from speaker to speaker, lead to multiple manifestations of the same underlying phoneme. Nevertheless, humans or, for that matter, animals such as dogs and cats, have little difficulty in recognizing spoken sounds. Furthermore, humans are able to do so against a background of noise or distraction, commonly referred to as the cocktail party syndrome.
In the past few years, one approach, referred to as HMM (Hidden Markov Models), has been successful in dealing with the variability in speech. This approach has achieved high performance in terms of accuracy, but requires extensive modeling of the particular language being used. In essence, the approach compensates for an inadequate representation of individual speech events by astutely guessing the best match over a sequence of such events. It captures the statistical properties of sequences of events in the parameters of the models. It achieves this high performance through a thorough statistical accounting of the particular body of speech data used to train the models and an efficient method of searching through a large number of hypotheses for the best match. Because of this, the performance degrades rapidly in the face of noise, novel speech or non-speech sounds, that is, in real-life environments. The problem of novel speech may be solved by a more comprehensive set of models, but this entails ever greater demand for searching.
In contrast, neural networks, the only other significant approach, is tolerant of noise, generalizes well, and requires relatively modest amount of computational power in actual performance. This approach has achieved results superior to that of HMM in the recognition of isolated phonemes from one speaker, but has difficulty with fluent speech and true speaker independence.
It is therefore desirable to provide a pattern recognition system in which these difficulties in both approaches described above are overcome.