A source-filter model of speech is illustrated schematically in FIG. 1a. As shown, speech can be modelled as comprising a signal from a source 102 passed through a time-varying filter 104. The source signal represents the immediate vibration of the vocal chords, and the filter represents the acoustic effect of the vocal tract formed by the shape of the throat, mouth and tongue. The effect of the filter is to alter the frequency profile of the source signal so as to emphasise or diminish certain frequencies. Instead of trying to directly represent an actual waveform, speech encoding works by representing the speech using parameters of a source-filter model.
As illustrated schematically in FIG. 1b, the encoded signal will be divided into a plurality of frames 106, with each frame comprising a plurality of subframes 108. For example, speech may be sampled at 16 kHz and processed in frames of 20 ms, with some of the processing done in subframes of 5 ms (four subframes per frame). Each frame comprises a flag 107 by which it is classed according to its respective type. Each frame is thus classed at least as either “voiced” or “unvoiced”, and unvoiced frames are encoded differently than voiced frames. Each subframe 108 then comprises a set of parameters of the source-filter model representative of the sound of the speech in that subframe.
For voiced sounds (e.g. vowel sounds), the source signal has a degree of long-term periodicity corresponding to the perceived pitch of the voice. In that case, the source signal can be modelled as comprising a quasi-periodic signal, with each period corresponding to a respective “pitch pulse” comprising a series of peaks of differing amplitudes. The source signal is said to be “quasi” periodic in that on a timescale of at least one subframe it can be taken to have a single, meaningful period which is approximately constant; but over many subframes or frames then the period and form of the signal may change. The approximated period at any given point may be referred to as the pitch lag. An example of a modelled source signal 202 is shown schematically in FIG. 2a with a gradually varying period P1, P2, P3, etc., each comprising a pitch pulse of four peaks which may vary gradually in form and amplitude from one period to the next.
According to many speech coding algorithms such as those using Linear Predictive Coding (LPC), a short-term filter is used to separate out the speech signal into two separate components: (i) a signal representative of the effect of the time-varying filter 104; and (ii) the remaining signal with the effect of the filter 104 removed, which is representative of the source signal. The signal representative of the effect of the filter 104 may be referred to as the spectral envelope signal, and typically comprises a series of sets of LPC parameters describing the spectral envelope at each stage. FIG. 2b shows a schematic example of a sequence of spectral envelopes 2041, 2042, 2043, etc. varying over time. Once the varying spectral envelope is removed, the remaining signal representative of the source alone may be referred to as the LPC residual signal, as shown schematically in FIG. 2a. The short-term filter works by removing short-term correlations (i.e. short term compared to the pitch period), leading to an LPC residual with less energy than the speech signal.
The spectral envelope signal and the source signal are each encoded separately for transmission. In the illustrated example, each subframe 106 would contain: (i) a set of parameters representing the spectral envelope 204; and (ii) an LPC residual signal representing the source signal 202 with the effect of the short-term correlations removed.
To improve the encoding of the source signal, its periodicity may be exploited. To do this, a long-term prediction (LTP) analysis is used to determine the correlation of the LPC residual signal with itself from one period to the next, i.e. the correlation between the LPC residual signal at the current time and the LPC residual signal after one period at the current pitch lag (correlation being a statistical measure of a degree of relationship between groups of data, in this case the degree of repetition between portions of a signal). In this context the source signal can be said to be “quasi” periodic in that on a timescale of at least one correlation calculation it can be taken to have a meaningful period which is approximately (but not exactly) constant; but over many such calculations then the period and form of the source signal may change more significantly. A set of parameters derived from this correlation are determined to at least partially represent the source signal for each subframe. The set of parameters for each subframe is typically a set of coefficients C of a series, which form a respective vector CLTP=(C1, C2, . . . Ci).
The effect of this inter-period correlation is then removed from the LPC residual, leaving an LTP residual signal representing the source signal with the effect of the correlation between pitch periods removed. To represent the source signal, the LTP vectors and LTP residual signal are encoded separately for transmission.
The sets of LPC parameters, the LTP vectors and the LTP residual signal are each quantised prior to transmission (quantisation being the process of converting a continuous range of values into a set of discrete values, or a larger approximately continuous set of discrete values into a smaller set of discrete values). The advantage of separating out the LPC residual signal into the LTP vectors and LTP residual signal is that the LTP residual typically has a lower energy than the LPC residual, and so requires fewer bits to quantize.
So in the illustrated example, each subframe 106 would comprise: (i) a quantised set of LPC parameters representing the spectral envelope, (ii)(a) a quantised LTP vector related to the correlation between pitch periods in the source signal, and (ii)(b) a quantised LTP residual signal representative of the source signal with the effects of this inter-period correlation removed.
In contrast with voiced sounds, for unvoiced sounds such as plosives (e.g. “T” or “P” sounds) the modelled source signal has no substantial degree of periodicity. In that case, long-term prediction (LTP) cannot be used and the LPC residual signal representing the modelled source signal is instead encoded differently, e.g. by being quantized directly.
FIG. 3a shows a diagram of a linear predictive speech encoder 300 comprising an LPC synthesis filter 306 having a short-term predictor 308 and an LTP synthesis filter 304 having a long-term predictor 310. The output of the short-term predictor 308 is subtracted from the speech input signal to produce an LPC residual signal. The output of the long-term predictor 310 is subtracted from the LPC residual signal to create an LTP residual signal. The LTP residual signal is quantized by a quantizer 302 to produce an excitation signal, and to produce corresponding quantisation indices for transmission to a decoder to allow it to recreate the excitation signal. The quantizer 302 can be a scalar quantizer, a trellis quantizer, a vector quantizer, an algebraic codebook quantizer, or any other suitable quantizer. The output of a long term predictor 310 in the LTP synthesis filter 304 is added to the excitation signal, which creates the LPC excitation signal. The LPC excitation signal is input to the long-term predictor 310, which is a strictly causal moving average (MA) filter controlled by the pitch lag and quantized LTP coefficients. The output of a short term predictor 308 in the LPC synthesis filter 306 is added to the LPC excitation signal, which creates the quantized output signal for feedback for subtraction of the input. The quantized output signal is input to the short-term predictor 308, which is a strictly causal MA filter controlled by the quantized LPC coefficients.
FIG. 3b shows a linear predictive speech decoder 350. Quantization indices are input to an excitation generator 352 which generates an excitation signal. The output of a long term predictor 360 in a LTP synthesis filter 354 is added to the excitation signal, which creates the LPC excitation signal. The LPC excitation signal is input to the long-term predictor 360, which is a strictly causal MA filter controlled by the pitch lag and quantized LTP coefficients. The output of a short term predictor 358 in a short-term synthesis filter 356 is added to the LPC excitation signal, which creates the quantized output signal. The quantized output signal is input to the short-term predictor 358, which is a strictly causal MA filter controlled by the quantized LPC coefficients.
The encoder 300 works by using an LPC analysis (not shown) to determine a short-term correlation in recently received samples of the speech signal, then passing coefficients of that correlation to the LPC synthesis filter 306 to predict following samples. The predicted samples are fed back to the input where they are subtracted from the speech signal, thus removing the effect of the spectral envelope and thereby deriving an LTP residual signal representing the modelled source of the speech. In the case of voiced frames, the encoder 300 also uses an LTP analysis (not shown) to determine a correlation between successive received pitch pulses in the LPC residual signal, then passes coefficients of that correlation to the LTP synthesis filter 304 where they are used to generate a predicted version of the later of those pitch pulses from the last stored one of the preceding pitch pulses. The predicted pitch pulse is fed back to the input where it is subtracted from the corresponding portion of the actual LPC residual signal, thus removing the effect of the periodicity and thereby deriving an LTP residual signal. Put another way, the LTP synthesis filter uses a long-term prediction to effectively remove or reduce the pitch pulses from the LPC residual signal, leaving an LTP residual signal having lower energy than the LPC residual.
An aim of the above techniques is to recreate more natural sounding speech without incurring the bitrate that would be required to directly represent the waveform of the immediate speech signal. However, a certain perceived coarseness in the sound quality of the speech can still be caused due to the quantization, e.g. of the quantised LTP residual in the case of voiced sounds or the quantized LPC residual in the case of unvoiced sounds. It would be desirable to find a way of reducing this quantization distortion without incurring undue bitrate in the encoded signal, i.e. to improve the rate-distortion performance.