Human speech as well as other animal vocalizations consist primarily of vowels, non-stop consonants, and pauses; where vowels typically represent about seventy percent of the speech signal, consonants about fifteen percent, pauses about three percent, and transition zones between vowels and consonants the remaining twelve percent or so. As the vowel sound components form the biggest parts of speech, any form of processing which intends to maintain high-fidelity with the original (unprocessed) speech should desirably reproduce the vowel sounds or vowel signals correctly as much as possible. Naturally, the non-stop consonant, pauses, and other sound or signal components should desirably be reproduced with an adequate degree of fidelity so that nuances of the speaker's voice are rendered with appropriate clarity, color, and recognizability.
In the description here, we use the term speech "signal" to refer to the acoustic or air time varying pressure wave changes emanating from the speaker's mouth, or to the acoustic signal that may be reproduced from a prior recording of the speaker such as may be generated from a speaker or other sound transducer, or from an electrical signal generated from such acoustic wave, or from a digital representation of any of the above acoustic or electrical representations.
A time versus signal amplitude graph for an electrical signal representing an approximate 0.2 second portion of speech (the syllable "ta") is depicted in the graph of FIG. 4, which includes the consonant "t", the transition zone "t-a", and the vowel "a". The vowel and transition signal components comprise of a sequence of pitches. Each pitch represents the acoustic response of the articulator volume and geometry (that is the part of the respiratory tract generally located between and including the lips and the larynx) to an impulse of air pressure produced by the copula.
The frequency of copula contractions for normal speech is typically between about 80 and 200 contractions per second. The geometry of the articulator changes much slower than the copular contractions, changing at a frequency of between about four to seven times per second, and more typically between about five and six times per second. Therefore, in general, the articulator geometry changes very little between two adjacent consecutive copula contractions. As a result, the duration of the pitch and the waveform change very little between two consecutive pitches, and although somewhat more change may occur between every third or fourth pitch, such changes may still be relatively small.
Conventional systems and methods for reducing speech information storage have typically relied frequency domain processing to reduce the amount of data that is stored or transmitted. In one conventional approach to speech compression that relies on a sort of time domain processing, periods of silence, voiced sound, and unvoiced sound within an utterance are detected and a single representative voiced sound utterance is repeatedly utilized along with its duration to approximate each voiced sound along with the duration of each voiced sound. The spectral content of each unvoiced sound portions of the utterance and variations in amplitude are also determined. A compressed data representation of the utterance is generated which includes an encoded representation of periods of silence, a duration and single representative data frame for each voiced sound, and a spectral content and amplitude variations for each unvoiced sound. U.S. Pat. No. 5,448,679 to McKiel, Jr., for example, is an example speech compression of this type. Unfortunately, even this approach does not take into account the nature of human speech where the pattern of the vowel sound is not constant but rather changes significantly between pitches. As a result, the quality of the reproduced speech suffers significant degradation as compared to the original speech.
Therefore there remains a need for system, apparatus, and method for reducing the information or data transmission and storage requirements while retaining accurate high-fidelity speech.