1. Field of the Invention
This invention relates to a recognition system and in particular to systems for the recognition of waveforms. Particular applications of such systems include the recognition of speech waveforms or waveforms arising from any other physical process. Throughout this specification particular reference will be made to speech recognition systems. However, the present invention is equally applicable to other recognition problems.
2. Description of the Related Art
In conventional waveform recognition systems the waveform is first segmented into time frames, and various pre-processing steps carried out before actual recognition takes place, as is shown for example in U.S. Pat. No. 4,400,828 in which feature signals are segmented, normalized and warped to a standard duration before recognition. The aim of these stages is to reduce some of the redundancy of the waveform and to work with a less repetitive pattern.
In speech recognition the original waveform to be recognized is one of sound pressure varying with time. This variation in amplitude may be represented electronically as, for example, a voltage level varying with time. However the characteristic commonly studied in known speech recognizers is that of the variation of energy with frequency for successive short time segments of the waveform. Such a system is shown for example in European Patent Application 0086589 where the speech patterns to be recognized are a time series of frequency spectrum envelopes. Such spectrum transformation from a time domain to a frequency domain representation is used to derive spectrograms of unknown words which can then be correlated with the spectrograms of known words for recognition by choosing the reference spectrogram which is most similar to the unknown spectrogram.
Such spectrograms can be obtained for example from a set of tuned filters whose outputs are sampled periodically thus producing a spectrogram of a particular time window of speech. To compensate for the low high frequency spectral magnitudes of some distinctive features it is also common to pre-emphasize the spectral content of the waveform by amplifying the signal by a factor which increases with frequency.
The aim of such signal transformations is to improve the recognition performance of the overall system. However, although much signal redundancy is removed, information is also lost. For instance, the time ordering of events separated by periods less than the width of the transform window or the filter bank time constant are lost. The loss of such information has a detrimental effect on the recognition performance on waveforms which are only distinguishable by short transient events.
Such spectrogram correlation methods are conventionally extended by detecting the peaks in energy called formants which can be observed in spectrograms. Spoken words are characterized by the pattern of energy peaks in the frequency - time domain, but as with phonemes, there is no definition of formants which is independent of word context or speaker. Moreover formants are extremely difficult to locate reliably in real speech.
In addition to the above problems speech signals suffer from considerable variation between repetitions of the same utterance, and between utterances from different speakers of the same words. Such variations can occur in a variety of characteristics one example being the time duration of a word. This hampers conventional recognition systems which are unable to act independently of such variability.
Non-linear variations in the duration of words are conventionally handled by allowing the spectrograms being correlated to stretch in time or frequency by a process known as Dynamic Time Warping (DTW). However such methods have a large processing requirement and the consequently less specific matching process increases the likelihood of mismatches between similar sounding words e.g. pin, bin.
The preliminary segmentation of speech into words that the above systems require is generally achieved by assuming that the energy of the acoustic signal drops beneath a threshold for a sufficient period of time between words to be detectable. However with connected speech where words are run together such an assumption is incorrect. Furthermore, if a DTW technique is being used this necessitates the word category decision being made in parallel with the word segmentation decision even though this requires an even greater computational requirement.
In contrast to the above recognizers very few known waveform recognizers work directly from the speech waveform and thus in the time domain because of the seeming impracticability of matching sample waveforms directly with reference waveforms. There are some systems which use zero crossing detection as an alternative to the above frequency spectrum analysis. Zero crossings however give only a crude measure of the original waveform and much of the essential information for recognition contained in the waveform is lost.
Some investigations of the time domain signal for speech (as opposed to the more common frequency domain spectrograms) have been made as for example that disclosed in the PhD Thesis of J. M. Baker (Carnegie Mellon University 1975). However such studies have been limited to the observation of distinctive phonetic events and their characterization by five measures: cycle period, cycle frequency, cycle amplitude, and two measures of high frequency components within each cycle. This is an extension of the zero crossing method but it still does not take account of important relationships between successive cycles in the signal. It also cannot cope with any within-cycle structure other than through the maximum amplitude measure and two rough estimates of high frequency content.