The present invention relates to speech processing apparatus and methods. More particularly, the present invention relates to improved apparatus and methods for use in automatic speech recognition technology to process a class of speech sounds called burst-friction sounds.
The present patent application is also directed to improvements in speech processing apparatus and methods over those described in coassigned J. D. Miller U.S. patent application Ser. No. 060,397 filed June 9, 1987, which provides an extensive description of a new speech processing technology and is incorporated herein by reference.
Speech, as it is perceived, can be thought of as being made up of segments or speech sounds. These are the phonetic elements, the phonemes, of a spoken language and they can be represented by a set of symbols, such as International Phonetic Association symbols.
These segments are linguistic units and have their bases in speech as it is perceived and spoken. All of the syllables and words of a language are made up of a relatively small number of phonetic elements. For example, in the case of English, textbooks in phonetics may list as few as 25 consonants and 12 vowels for a total of 37 phonemes. If the finer phonetic distinctions are included, then the list of distinguishable speech sounds or phones may lengthen to as high as 50 or 60.
Burst-friction spectra are involved in the perception of voiced plosives (e.g. /g/, /d/, and /b/) and voiceless aspirated and unaspirated stops or plosives (e.g. sounds of k, t or p), voiceless fricatives (e.g. s, h, sh, th in "both", f and wh) and voiced fricatives (e.g. z, zh, j, v and th in "the"). Thus, burst-friction spectra participate in a large part of the speech sound inventory of most natural languages. Other types of speech sounds include the nasal consonants, the approximants, and the vowels.
It has been proposed that the phonemes of a spoken language can be understood in terms of a small set of distinctive features numbering about 12. These features have their bases in articulatory, perceptual, and linguistic analyses. A feature approach is often used in textbooks on phonetics as the phones and phonemes are described in terms of place of articulation and manner of articulation.
There are several viable theories of speech perception attempting to explain how the human listener processes an incoming acoustic waveform of speech and translates that waveform into a series of linguistic elements such as phonemes or words. All of the current theories can be cast into a generic three-stage model, with the acoustic input undergoing three stages of processing in a bottom-up sequence. Stage 1 is an auditory-sensory analysis of the incoming acoustic waveform whereby representation of the signal is achieved in auditory-sensory terms. Stage 2 is an auditory-perceptual transformation whereby the spectral output of stage 1 is transformed into a perceptual form relevant to phonetic recognition. Here the spectral descriptions are transformed into dimensions more directly relevant to perception. For example, in various theories the perceptual form may be related to articulatory correlates of speech production or auditory features or pattern sequences. Finally, there is stage 3 in which the perceptual dimensions of stage 2 are transformed by a phoneticlinguistic transformation into strings of phonemes, syllables, or words. Stages 2 and 3 also are influenced by top-down processing wherein stored knowledge of language and events and recent inputs, including those from other senses in addition to language as heard, are brought into play.
Some work in automatic speech recognition has involved a narrow-band spectral analysis performed on a time-windowed speech waveform. In one system described in "Recognizing continuous speech remains an elusive goal" by R. Reddy et al., IEEE Spectrum, Nov., 1983, pp. 84-87, incoming digitized signals are broken into centisecond slices and spectrally analyzed. Each slice is compared with a collection of sound prototypes and the prototype closest to each slice is entered into a sequence. The prototype sequence is then used to roughly categorize the initial sound of the word, which in turn is used to produce word hypotheses Each word is then tested by creating a probability matrix and a cycle of operation repeats for the next word until an entire sentence is identified.
U.S. Pat. No. 4,667,341 discusses a continuous speech recognition system directed to the problem of reducing the probability of false recognition.
The exact mechanisms and processes involved in the perception of speech are even yet not fully understood in the art. However, the theoretical and technological framework for speech processing described in the coassigned J. D. Miller incorporated patent application has opened up a new direction in automatic speech processing.
Still further improvements in recognition of speech sounds are needed in the art, and one of the areas where improvements are particularly desirable is in the processing of burst-friction speech sounds to make them more accurately distinguishable by automatic speech recognition apparatus. A paper by Stevens, K. N. et al., "Crosslanguage Study of Vowel Perception", Lang. and Speech, Vol. 12, pp. 1-23 (1969, FIGS. 9 and 12 dealing with stop consonants) discusses which peaks are perceptually significant, in spectra generated for consonants that are already-known beforehand. However, reverse determinations now sought to accurately characterize unknown burst-friction speech sounds from their spectra are seemingly contradictory and unpredictable.