In order to store or transmit voice at low bit rates, it is known to digitize the human speech and then to encode the speech so as to minimize the number of digital bits per second required to represent the speech. The analog speech samples are customarily portioned into frames or segments of discrete length on the order of 20 milliseconds in duration. Sampling is typically performed at a rate of 8 kilohertz (kHz) and each sample is encoded into a multi-bit digital number. Successive coded samples are further processed in a linear predictive coder (LPC) that determines appropriate filter coefficients/parameters that model the human vocal tract. The filter parameters can be used to estimate present values of each signal sample efficiently on the basis of the weighted sum of a preselected number of prior sampled values. The filter parameters model the formant structure of the vocal tract transfer function. The speech signal is regarded analytically as being composed of an excitation signal and a formant transfer function. The excitation component arises in the larynx or voice box and the formant component results from the operation of the remainder of the vocal tract on the excitation component. The excitation component is further classified as voiced or unvoiced, depending upon whether or not there is a fundamental frequency imparted to the air stream by the vocal cords. If there is a fundamental frequency imparted to the air stream by the vocal cords, then the excitation component is classified as voiced. If the excitation is unvoiced, then the excitation component is simply classified as white noise in the prior art. To encode speech for low bit rate transmission, it is necessary to determine the LPC coefficients for the segments of speech and transfer these coefficients to the decoding circuit that is to reproduce the speech. In addition, it is necessary to determine the excitation component and to transfer this component to the decoding circuit, or as it is also commonly called, a synthesizer.
One method for determining the excitation to be utilized in the synthesizer is the multi-pulse excitation model that is described in U.S. Pat. No. 4,472,832, issued on Sept. 18, 1984, to B. S. Atal, et al. This method functions by determining a number of pulses for each frame which are then used by the synthesizer to excite the formant filter. These pulses are determined by an analysis by synthesis method as is described in the previously cited paper. Whereas the multi-pulse excitation model performs well at bit rates at 9.6 Kbs, and above the quality of speech synthesis starts to degrade at lower bit rates. In addition, during the voiced regions of the speech, the synthesized speech can be slightly rough and not true to the original speech. Another problem that exists with the multi-pulse excitation model is the large amount of computation required to determine the pulses for each frame since the calculation of the pulses requires a number of complex mathematical operations.
Another method utilized for determining the excitation for LPC synthesized speech is to determine the pitch or fundamental frequency being generated by the larynx during the voiced regions. The synthesizer, upon receiving the pitch, then generates the corresponding frequency to excite the formant filter. During the periods when the speech is considered to be unvoiced, this fact is transmitted to the synthesizer, and the synthesizer utilizes a white noise generator to excite the formant filter. A problem with this method is that the white noise excitation is an inadequate excitation for plosive consonants, transitions between voiced and unvoiced speech frame sequences, and voiced frames which are erroneously declared unvoiced. This problem results in the synthesized speech not sounding the same as the original speech.
In view of the above, there exists a need for an excitation model that can accurately model both the voiced and unvoiced regions of speech and properly handle the transitional areas between unvoiced and voiced frame sequences as well as reproduce the plosive consonants.