The present invention relates to speech processing apparatus and methods. More particularly, the present invention relates to apparatus and methods for use in automatic speech recognition applications and research.
Speech, as it is perceived, can be thought of as being made up of segments of speech sounds. There are the phonetic elements, the phonemes, of a spoken language and they can be represented by a set of symbols, such as International Phonetic Association symbols.
These segments are linguistic units and have their bases in speech as it is perceived and spoken. All of the syllables and words of a language are made up of a relatively small number of phonetic elements. For example, in the case of English, textbooks in phonetics may list as few as 25 consonants and 12 vowels for a total of 37 phonemes. If the finer phonetic distinctions are included, then the list of distinguishable speech sounds or phones may lengthen to as high as 50 or 60.
It has been proposed that the phonemes of a spoken language can be understood in terms of a small set of distinctive features numbering about 12. These features have their bases in articulatory, perceptual, and linguistic analyses. A feature approach is often used in textbooks on phonetics as the phones and phonemes are described in terms of place of articulation and manner of articulation.
There are several theories of how the human listener processes are incoming acoustic waveform of speech and translates that waveform into a series of linguistic elements such as phonemes or words. The exact mechanisms and processes involved in the perception of speech are not yet fully understood. Finding simple and reliable acoustic-auditory correlates of the phones, phonemes and presumed features have proved elusive.
Research on speech perception has led to complicated, highly conditioned statements of relations between acoustic- auditory patterns and perception of phonemes, and even these statements are often of narrowly circumscribed generality. For example, the problem of how the listener can divide the acoustic input into segments relevant to linguistic perception is not understood. Even if a solution of this segmentation problem were available, the auditory-acoustic expression of a phoneme or feature seems to depend on the phonetic context, the particular talker, and the rate of speaking.
As a result of these problems there are several viable theories of speech perception. All of the current theories can be cast into a generic three-stage model, with the acoustic input undergoing three stages of processing in a bottom-up sequence. Stage 1 is an auditory-sensory analysis of the incoming acoustic waveform whereby representation of the signal is achieved in auditory-sensory terms. Stage 2 is an auditory- perceptual transformation whereby the spectral output of stage 1 is transformed into a perceptual form relevant to phonetic recognition. Here the spectral descriptions are transformed into dimensions more directly relevant to perception. For example, in various theories the perceptual form may be related to articulatory correlates of speech production or auditory features or pattern sequences. Finally, there is stage 3 in which the perceptual dimensions of stage 2 are transformed by a phonetic- linguistic transformation into strings of phonemes, syllables, or words. Stages 2 and 3 also are influenced by top-down processing wherein stored knowledge of language and events and recent inputs, including those from other senses as well as language, are brought into play.
Some work in automatic speech recognition has involved a narrow-band spectral analysis performed on a time-windowed speech waveform. In one system described in "Recognizing continuous speech remains an elusive goal" by R. Reddy et al., IEEE Spectrum, Nov., 1983, pp. 84-87, incoming digitized signals are broken into centisecond slices and spectrally analyzed. Each slice is compared with a collection of sound prototypes and the prototype closest to each slice is entered into a sequence. The prototype sequence is then used to roughly categorize the initial sound of the word, which in turn is used to produce word hypotheses. Each word is then tested by creating a probability matrix and a cycle of operation repeats for the next word until an entire sentence is identified.
Diphthongs, glides, and r-colored vowels are speech sounds that are all generically referred to as glides herein. Analysis of these sounds continues to pose difficult problems among the many faced in the field of automatic speech recognition. A paper which discusses some of these types of speech sounds is "Transitions, Glides, and Diphthongs" by I. Lehiste et al., J. Acoust. Soc. Am., Vol. 33, No. 3, March, 1961, pp. 268-277.