A source-filter model of speech is illustrated schematically in FIG. 1a. As shown, speech can be modelled as comprising a signal from a source 102 passed through a time-varying filter 104. The source signal represents the immediate vibration of the vocal chords, and the filter represents the acoustic effect of the vocal tract formed by the shape of the throat, mouth and tongue. The effect of the filter is to alter the frequency profile of the source signal so as to emphasise or diminish certain frequencies. Instead of trying to directly represent an actual waveform, speech encoding works by representing the speech using parameters of a source-filter model.
As illustrated schematically in FIG. 1b, the encoded signal will be divided into a plurality of frames 106, with each frame comprising a plurality of subframes 108. For example, speech may be sampled at 16 kHz and processed in frames of 20 ms, with some of the processing done in subframes of 5 ms (four subframes per frame). Each frame comprises a flag 107 by which it is classed according to its respective type. Each frame is thus classed at least as either “voiced” or “unvoiced”, and unvoiced frames are encoded differently than voiced frames. Each subframe 108 then comprises a set of parameters of the source-filter model representative of the sound of the speech in that subframe.
For voiced sounds (e.g. vowel sounds), the source signal has a degree of long-term periodicity corresponding to the perceived pitch of the voice. In that case, the source signal can be modelled as comprising a quasi-periodic signal, with each period corresponding to a respective “pitch pulse” comprising a series of peaks of differing amplitudes. The source signal is said to be “quasi” periodic in that on a timescale of at least one subframe it can be taken to have a single, meaningful period which is approximately constant; but over many subframes or frames then the period and form of the signal may change. The approximated period at any given point may be referred to as the pitch lag. The pitch lag can be measured in time or as a number of samples. An example of a modelled source signal 202 is shown schematically in FIG. 2a with a gradually varying period P1, P2, P3, etc., each comprising a pitch pulse of four peaks which may vary gradually in form and amplitude from one period to the next.
According to many speech coding algorithms such as those using Linear Predictive Coding (LPC), a short-term filter is used to separate out the speech signal into two separate components: (i) a signal representative of the effect of the time-varying filter 104; and (ii) the remaining signal with the effect of the filter 104 removed, which is representative of the source signal. The signal representative of the effect of the filter 104 may be referred to as the spectral envelope signal, and typically comprises a series of sets of LPC parameters describing the spectral envelope at each stage. FIG. 2b shows a schematic example of a sequence of spectral envelopes 2041, 2042, 2043, etc. varying over time. Once the varying spectral envelope is removed, the remaining signal representative of the source alone may be referred to as the LPC residual signal, as shown schematically in FIG. 2a. The short-term filter works by removing short-term correlations (i.e. short term compared to the pitch period), leading to an LPC residual with less energy than the speech signal.
The spectral envelope signal and the source signal are each encoded separately for transmission. In the illustrated example, each subframe 106 would contain: (i) a set of parameters representing the spectral envelope 204; and (ii) an LPC residual signal representing the source signal 202 with the effect of the short-term correlations removed.
To improve the encoding of the source signal, its periodicity may be exploited. To do this, a long-term prediction (LTP) analysis is used to determine the correlation of the LPC residual signal with itself from one period to the next, i.e. the correlation between the LPC residual signal at the current time and the LPC residual signal after one period at the current pitch lag (correlation being a statistical measure of a degree of relationship between groups of data, in this case the degree of repetition between portions of a signal). In this context the source signal can be said to be “quasi” periodic in that on a timescale of at least one correlation calculation it can be taken to have a meaningful period which is approximately (but not exactly) constant; but over many such calculations then the period and form of the source signal may change more significantly. A set of parameters derived from this correlation are determined to at least partially represent the source signal for each subframe. The set of parameters for each subframe is typically a set of coefficients of a series, which form a respective vector.
The effect of this inter-period correlation is then removed from the LPC residual, leaving an LTP residual signal representing the source signal with the effect of the correlation between pitch periods removed. To represent the source signal, the LTP vectors and LTP residual signal are encoded separately for transmission. In the encoder, an LTP analysis filter uses one or more pitch lags with the LTP coefficients to compute the LTP residual signal from the LPC residual.
The pitch lags, the LTP vectors and the LTP residual signal are sent to the decoder together with the coded LTP residual, and used to construct the speech output signal. They are each quantised prior to transmission (quantisation being the process of converting a continuous range of values into a set of discrete values, or a larger approximately continuous set of discrete values into a smaller set of discrete values). The advantage of separating out the LPC residual signal into the LTP vectors and LTP residual signal is that the LTP residual typically has a lower energy than the LPC residual, and so requires fewer bits to quantize.
So in the illustrated example, each subframe 106 would comprise: (i) a quantised set of LPC parameters (including pitch lags) representing the spectral envelope, (ii)(a) a quantised LTP vector related to the correlation between pitch periods in the source signal, and (ii)(b) a quantised LTP residual signal representative of the source signal with the effects of this inter-period correlation removed.
In order to minimise the LTP residual it is advantageous to update the pitch lags frequently. Typically, a new pitch lag is defined every subframe of 5 or 10 ms. However, transmitting pitch lags comes at a cost in bit rate, as it typically takes 6 to 8 bits to encode one pitch lag.
One approach to reduce the cost in bit rate is to specify the pitch lags to some of the subframes relative to the lag of the preceding subframes. By not allowing lag difference to exceed a certain range, the relative lag requires fewer bits for encoding.
The restriction on lag difference however can lead to inaccurate or unnatural pitch lags which then affect speech decoding.