The present inventions relate to a communication system and more particularly to a speech compression method for a communication system.
Many speech compression systems are known. Generally, these systems may be divided into three types: time domain, frequency domain and hybrid codecs. However, in case of the low bit-rate coding, multi-band excitation (MBE) compression technique provides the best quality of the decoded speech.
The MBE vocoders encode the obtained speech signal by first dividing the input speech into constrained frames. These frames are transformed from the time domain to the frequency domain. Thereafter, a frequency spectrum of the framed and windowed signal is calculated, and an analysis of the frequency spectrum is performed. Speech model parameters such as a pitch value, a set of voiced/unvoiced decisions for the frequency bands, a set of spectral magnitudes and corresponding phase values are necessary for the speech synthesis in MBE vocoders. Usually, the phase values are not transmitted for low bit-rate coding.
There are numerous ways of spectrum approximation, all of which are based on an approximation of the frequency bands by some excitation function. The most traditional kind of an excitation function is a frequency response of the Hamming window. However, the Hamming window only obtains a good approximation of the original spectrum for stationary speech signals. For non-stationary speech signals, a predetermined kind of excitations function does not match well enough to the real shape of the spectrum for an accurate approximation. For example, a pitch frequency change during the analysis period may cause a widening of the peaks in the spectral magnitude envelope. Thus, the width of the peaks of the predetermined excitation function would no longer correspond to the width of the real peaks. Moreover, if the analyzed speech frame is a blend of two different processes, the spectrum would have a very complex shape, which is rather difficult to accurately approximate by means of a predetermined simple excitation function.
There are also many techniques for encoding the MBE parameters. Typically, a simple scalar quantization is used for encoding a pitch value and a band grouping method is used for encoding the voiced/unvoiced decisions. The most difficult task is the encoding of the spectral magnitudes, for which a Vector Quantization (VQ), a Linear Prediction and the like are used. Numerous high efficiency compression methods have been proposed based on VQ, one of which is a method of hierarchical structured codebook used for encoding spectral magnitudes.
Although the VQ technique allows an accurate quantizing in some problem area, it is generally effective for data close to those which has been included in the xe2x80x9clearning sequencesxe2x80x9d. Other effective methods for encoding spectral magnitudes are intra-frame and inter-frame linear prediction. The intra-frame method allows for an adequate encoding of spectral magnitudes, but its effectiveness is substantially deteriorated at low bit-rate coding. The inter-frame prediction method is also fairly good, but its usage is reasonable only for stationary speech signals.
The speech synthesis in the related art is carried out according to an accepted speech model. Generally, the two components of the MBE vocoders, the voiced and unvoiced parts of speech, are synthesized separately and combined later to produce a complete speech signal.
The unvoiced component of the speech is generated for the frequency bands, which are determined to be unvoiced. For each speech frame, a block of random noise is windowed and transformed to the frequency domain, wherein the regions of the spectrum corresponding to the voiced harmonics are set to zero. The remaining spectral components corresponding to the unvoiced parts of speech are normalized to the unvoiced harmonic magnitudes.
A different technique is used for generating the voiced component of the speech in the MBE approach. Since the voiced speech is modeled by its individual harmonics in the frequency domain, it can be implemented at the decoder as a bank of tuned oscillators. An oscillator is defined by its amplitude, frequency and phase, and is assigned to each harmonic in the voiced regions of a frame.
However, the variations in the estimated parameters of the adjacent frames may cause discontinuities at the edges of the frames, resulting in a significant degradation of speech quality. Thus, during the synthesis, both the current and previous frames"" parameters are interpolated to ensure a smooth transition at the frame boundaries, resulting in a continuous voiced speech at the frame boundaries.
Different implementations of interpolation schemes (for amplitude, frequency and phase) are possible. However, the interpolation schemes are generally only satisfactory under steady pitch. In case of sharp changing pitch, implementing processing rules do not lead to satisfactory results due to the traditional lacing of harmonics relating to the same number of frequency bands of the neighboring speech frames. In case of a pitch frequency change, a difference of frequencies of the laced harmonics appears and under conventional correspondence of harmonic bands, this difference is more significant for higher band numbers and for higher degree of pitch change. As a result, annoying artifacts in the decoded speech appear.
Accordingly, an object of the present invention is to solve at least the problems and disadvantages of the related art.
Another object of the present invention is to provide a method, which improves the quality of the speech spectrum approximation, for both voiced and unvoiced bands.
Another object of the present invention is to improve the encoding efficiency of the spectral magnitude set, regardless of the bit-rate for encoding.
A further object of the present invention is to improve the quality of speech synthesis.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and advantages of the invention may be realized and attained as particularly pointed out in the appended claims.
To achieve the objects and in accordance with the purposes of the invention, as embodied and broadly described herein, the speech spectrum approximation is performed on the spectrum divided into plural bands according to the pitch frequency of the speech frame. The pitch frequency of the speech signal is determined, the frequency bands are built, and a voiced/unvoiced discrimination of the frequency bands is performed. Thereafter, an Analysis by Synthesis method of the speech spectrum approximation is used for calculating the magnitudes.
A more precise evaluation of the harmonic magnitudes at the encoder side results in an increase of quality for the voiced part of the signal reconstruction at the decoder side. Also, a more precise calculation of magnitudes for the unvoiced bands of spectrum results in a quality increase for the noise part of the reconstructed signal. The usage of the Analysis by Synthesis method both for the voiced and unvoiced bands provides a correct correspondence between the voiced and unvoiced parts of the reconstructed signal.
Also, the present invention improves the encoding efficiency of the spectral magnitudes set. In case of the low bit-rate encoding, the problem is to represent the spectral magnitudes data by a fixed number of bits. The present invention with respect to the spectral magnitudes encoding is divided into two main tasks: to reduce an original quantity of spectral magnitudes to the fixed number and to encode the reduced set. The present method solves the first task effectively by usage of Wavelet Transform (WT). Also, applying an inter-frame prediction effectively solves the second task, if the speech signal is stationary.
However, at time intervals containing non-stationary signals, no prediction is rather effective. Applying the Wavelet Transform technique effectively solves the encoding task in this case. The increase of encoding efficiency allows either an improved quality of reconstructed speech signal under the same bit-rate or a reduced bit-rate required for the same quality level.
Furthermore, the present invention improves the quality of speech synthesis. The speech synthesis is carried out sequentially for every frame. As a fundamental frequency is a base of the whole band division of the spectrum to be approximated, a difference of frequencies of the laced harmonics appears in case of the pitch change. The present invention uses a frequency correspondence between the laced bands of current and previous frames. This provides a correct and reliable speech synthesis process in conditions of the pitch frequency changes and the pitch frequency jumps. Even obvious troubles (errors) of pitch determination do not lead to dramatic consequences as in conventional schemes.