The present invention relates generally to digital voice decoding and, more particularly, to a method and apparatus for using harmonic modeling in an improved speech decoder.
A general diagram of a CELP encoder 100 is shown in FIG. 1 A. A CELP encoder uses a model of the human vocal tract in order to reproduce a speech input signal. The parameters for the model are actually extracted from the speech signal being reproduced, and it is these parameters that are sent to a decoder 112, which is illustrated in FIG. 1A. Decoder 112 uses the parameters in order to reproduce the speech signal. Referring to FIG. 1A, synthesis filter 104 is a linear predictive filter and serves as the vocal tract model for CELP encoder 100. Synthesis filter 104 takes an input excitation signal xcexc(n) and synthesizes a speech signal s(n) by modeling the correlations introduced into speech by the vocal tract and applying them to the excitation signal xcexc(n).
In CELP encoder 100 speech is broken up into frames, usually 20 ms each, and parameters for synthesis filter 104 are determined for each frame. Once the parameters are determined, an excitation signal xcexc(n) is chosen for that frame. The excitation signal is then synthesized, producing a synthesized speech signal sxe2x80x2(n). The synthesized frame sxe2x80x2(n) is then compared to the actual speech input frame s(n) and a difference or error signal e(n) is generated by subtractor 106. The subtraction function is typically accomplished via an adder or similar functional component as those skilled in the art will be aware. Actually, excitation signal xcexc(n) is generated from a predetermined set of possible signals by excitation generator 102. In CELP encoder 100, all possible signals in the predetermined set are tried in order to find the one that produces the smallest error signal e(n). Once this particular excitation signal xcexc(n) is found, the signal and the corresponding filter parameters are sent to decoder 112 (FIG. 1B), which reproduces the synthesized speech signal sxe2x80x2(n). Signal sxe2x80x2(n) is reproduced in decoder 112 by using an excitation signal xcexc(n), as generated by decoder excitation generator 114, and synthesizing it using decoder synthesis filter 116.
By choosing the excitation signal that produces the smallest error signal e(n), a very good approximation of speech inputs(n) can be reproduced in decoder 112. The spectrum of error signal e(n), however, will be very flat, as illustrated by curve 204 in FIG. 2. The flatness can create problems in that the signal-to-noise ratio (SNR), with regard to synthesized speech signal sxe2x80x2(n) (curve 202), may become too small for effective reproduction of speech signal s(n). This problem is especially prevalent in the higher frequencies where, as illustrated in FIG. 2, there is typically less energy in the spectrum of sxe2x80x2(n). In order to combat this problem, CELP encoder 100 includes a feedback path that incorporates error weighting filter 108. The function of error weighting filter 108 is to shape the spectrum of error signal e(n) so that the noise spectrum is concentrated in areas of high voice content. In effect, the shape of the noise spectrum associated with the weighted error signal ew(n) tracks the spectrum of the synthesized speech signal sxe2x80x2(n), as illustrated in FIG. 2 by curve 206. In this manner, the SNR is improved and the quality of the reproduced speech is increased.
In encoder 100 and decoder 112, the vocal tract model works by assuming that speech signal s(n) remains constant for short periods of time. Speech signal s(n) is not constant, however, and because speech signal s(n) (curve 302 in FIG. 3) is actually changing all the time, noise is induced in the quantized speech signal xcexc(n). As a result, the spectrum (curve 304 in FIG. 3) for quantized speech signal xcexc(n) is not as smooth or periodic as the spectrum for speech signal s(n). The result is that synthesized speech signal sxe2x80x2(n) (curve 306 in FIG. 3), in decoder 112, produces noisy speech that does not sound as good as the actual speech signal s(n). Ideally, the synthesized speech would sound very close to the actual speech, and thus provide a good listening experience.
There is provided a speech decoder comprising a means for generating an excitation signal and a means for performing harmonic analysis and synthesis on the excitation signal in order to generate a smooth, periodic speech signal. The speech decoder further comprises a mixing means for mixing the excitation signal with the smooth, periodic signal and a synthesizing means for synthesizing the modified excitation signal into a speech signal that can be played to a user through a listening means.
There is also provided a receiver that incorporates a speech decoder such as the decoder described above as well as a method for speech decoding. These and other embodiments as well as further features and advantages of the invention are described in detail below.