The present invention relates to speech processing and more specifically to a method and system for low bit rate digital encoding and decoding of speech using separate processing of voiced and unvoiced components of speech signal segments on the basis of a voicing probability determination.
Digital encoding of voiceband speech has been subject to intensive research for at least three decades now, as a result of which various techniques have been developed targeting different speech processing applications at bit rates ranging from about 64 kb/s to about 2.4 kb/s. Two of the main factors which influence the choice of a particular speech processing algorithm are the desired speech quality and the bit rate. Generally, the lower the bit rate of the speech coder, i.e. higher signal compression, the more the speech quality suffers to some extent. In each specific application, it is thus a matter of compromise between the desired speech quality, which in many instances is strictly specified, and the information capacity of the transmission channel and/or the speech processing system which determine the bit rate. The present invention is specifically directed to a low bit rate system and method for speech and voiceband coding to be used in speech processing and modern multimedia systems which require large volumes of data to be processed and stored, often in real time, and acceptable quality speech to be delivered over narrowband communication channels.
For practical low bit rate digital speech signal transformation, communication and storage purposes it is necessary to reduce the amounts of data to be transmitted and stored by eliminating redundant information without significant degradation of the output speech quality. There are some well known prior art speech signal compression and coding techniques which exploit signal redundancies to reduce the required bit rate. Generally, these techniques can be classified as speech processing using analysis-and-synthesis (AAS) and analysis-by-synthesis (ABS) methods. Although AAS methods, such as residual excited linear predictive coding (RELP), adaptive predictive coding (APC) and subband coding (SBC) have been successful at rates in the range of about 9.6-16 kb/s, below that range they can no longer produce good quality speech. The reasons for that are generally related to the fact that: (a) there is no feedback mechanism to control the distortions in the reconstructed speech; and (b) errors in one speech frame generally propagate in subsequent frames without correction. In ABS schemes, on the other hand, both these factors are taken into account which enables them to operate much more successfully in the low bit rate range.
Specifically, in ABS coding systems it is assumed that the signal can be observed and represented in some form. Then, a theoretical signal production model is assumed which has a number of adjustable parameters to model different ranges of the input signal. By varying parameters of the model in a systematic way it is thus possible to find a set of parameters that can produce a synthetic speech signal which matches the real signal with minimum error. In practical applications synthetic speech is most often generated as the output of a linear predictive coding (LPC) filter. Next, a residual, "excitation" signal is obtained by subtracting the synthetic model speech signal from the actual input signal. Generally, the dynamic range of the residual signal is much more limited, so that fewer bits are required for its transmission and storage. Finally, perceptually based minimization procedures can be employed to reduce the speech distortions at the synthesis end even further.
Various techniques have been used in the past to design the speech model filter, to form an appropriate excitation signal and minimize the error between the original signal and the synthesized output in some meaningful way. There appears to be a consensus, however, that no single technique is likely to succeed in all applications. The reason for this is that the performance of digital compression and coding systems for voice signals is highly dependent on the speaker and the selection of speech frames. The success of a technique selected in a particular application thus frequently depends on the accuracy of the underlying signal model and the flexibility in adjusting the model parameters. As known in the art, various speech signal models have been proposed in the past.
Most frequently, speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for the unvoiced sounds. For mathematical convenience, it is assumed that the speech signal is stationary within a given short time segment, so that the continuous speech is represented as an ordered sequence of distinct voiced and unvoiced speech segments.
Voiced speech segments, which correspond to vowels in a speech signal, typically contribute most to the intelligibility of the speech which is why it is important to accurately represent these segments. However, for a low-pitched voice, a set of more than 80 harmonic frequencies ("harmonics") may be measured within a voiced speech segment within a 4 kHz bandwidth. Clearly, encoding information about all harmonics of such segment is only possible if a large number of bits is used. Therefore, in applications where it is important to keep the bit rate low, more sophisticated speech models need to be employed.
One typical approach is to separate the speech signal into its voiced and unvoiced components. The two components are then synthesized separately and finally combined to produce the complete speech signal. For example, U.S. Pat. No. 4,771,465 describes a speech analyzer and synthesizer system using a sinusoidal encoding and decoding technique for voiced speech segments and noise excitation or multipulse excitation for unvoiced speech segments. In the process of encoding the voiced segments a fundamental subset of harmonic frequencies is determined by a speech analyzer and is used to derive the parameters of the remaining harmonic frequencies. The harmonic amplitudes are determined from linear predictive coding (LPC) coefficients. The method of synthesizing the harmonic spectral amplitudes from a set of LPC coefficients, however, requires extensive computations and yields relatively poor quality speech.
Different techniques focus on more accurate modeling of the excitation signal. The excitation signal in a speech coding system is very important because it reflects residual information which is not covered by the theoretical model of the signal. This includes the pitch, long term and random patterns, and other factors which are critical for the intelligibility of the reconstructed speech. One of the most important parameters in this respect is the is the determination of the accurate pitch. Studies have shown that the human ear is more sensitive to changes in the pitch compared to changes in other speech signal parameters by an order of magnitude, which is why a number of techniques to accurately estimate the pitch have been proposed in the past. For example, U.S. Pat. Nos. 5,226,108 and 5,216,747 to Hardwick et al. describe an improved pitch estimation method providing sub-integer resolution. The quality of the output speech according to the proposed method is improved by increasing the accuracy of the decision as to whether given speech segment is voiced or unvoiced. This decision is made by comparing the energy of the current speech segment to the energy of the preceding segments. The proposed methods, however, generally do not allow accurate estimation of the amplitude information for all harmonics.
In an approach related to the harmonic signal coding techniques discussed above, it has been proposed to increase the accuracy of the signal reconstruction by using a series of binary voiced/unvoiced decisions corresponding to each speech frame in what is known in the art as multiband excitation (MBE) coders. The MBE speech coders provide more flexibility in the selection of speech voicing compared with traditional vocoders, and can be used to generate good quality speech. In fact, an improved version of the MBE (IMBE) vocoder operating at 4.15 kb/s, with forward error correction (FEC) making it up to 6.4 kb/s, has been chosen for use in INMARSAT-M. In these speech coders, however, typically the number of harmonic magnitudes in the 4 kHz bandwidth varies with the fundamental frequency, requiring variable bit allocation for each harmonic magnitude from one frame to another, which can result in variable speech quality for different speakers. Another limitation of the IMBE coder is that the bit allocation for the model parameters depends on the fundamental frequency, which reduces the robustness of the system to channel errors. In addition, errors in the voiced/unvoiced decisions, especially when made in the low frequency bands, result in perceptually objectionable degradation in the quality of the output speech.
Therefore, it is perceived that there exists a need for more flexible methods for encoding and decoding of speech, which can be used in low bit rate applications. Accordingly, there is a present need to develop a modular system in which optimized processing of different speech segments, or speech spectrum bands, is performed in specialized processing blocks to achieve best results for different types of speech and other acoustic signal processing applications. Furthermore, there is a need to more accurately classify each speech segment in terms of its voiced/unvoiced content in order to apply optimum signal compression for each type of signal. In addition, there is a need to obtain accurate estimates of the amplitudes of the spectral harmonics in voiced speech segments in a computationally efficient way and to develop a method and system to synthesize such voiced speech segments without the requirement to store or transmit separate phase information.