A source-filter model of speech is illustrated schematically in FIG. 1a. As shown, speech can be modelled as comprising a signal from a source 102 passed through a time-varying filter 104. The source signal represents the immediate vibration of the vocal chords, and the filter represents the acoustic effect of the vocal tract formed by the shape of the throat, mouth and tongue. The effect of the filter is to alter the frequency profile of the source signal so as to emphasize or diminish certain frequencies. Instead of trying to directly represent an actual waveform, speech encoding works by representing the speech using parameters of a source-filter model.
As illustrated schematically in FIG. 1b, the encoded signal will be divided into a plurality of frames 106, with each frame comprising a plurality of subframes 108. For example, speech may be sampled at 16 kHz and processed in frames of 20 ms, with some of the processing done in subframes of 5 ms (four subframes per frame). Each frame comprises a flag 107 by which it is classed according to its respective type. Each frame is thus classed at least as either “voiced” or “unvoiced”, and unvoiced frames are encoded differently than voiced frames. Each subframe 108 then comprises a set of parameters of the source-filter model representative of the sound of the speech in that subframe.
For voiced sounds (e.g. vowel sounds), the source signal has a degree of long-term periodicity corresponding to the perceived pitch of the voice. In that case, the source signal can be modelled as comprising a quasi-periodic signal, with each period corresponding to a respective “pitch pulse” comprising a series of peaks of differing amplitudes. The source signal is said to be “quasi” periodic in that on a timescale of at least one subframe it can be taken to have a single, meaningful period which is approximately constant; but over many subframes or frames then the period and form of the signal may change. The approximated period at any given point may be referred to as the pitch lag. An example of a modelled source signal 202 is shown schematically in FIG. 2a with a gradually varying period P1, P2, P3, etc., each comprising a pitch pulse of four peaks which may vary gradually in form and amplitude from one period to the next.
According to many speech coding algorithms such as those using Linear Predictive Coding (LPC), a short-term filter is used to separate out the speech signal into two separate components: (i) a signal representative of the effect of the time-varying filter 104; and (ii) the remaining signal with the effect of the filter 104 removed, which is representative of the source signal. The signal representative of the effect of the filter 104 may be referred to as the spectral envelope signal, and typically comprises a series of sets of LPC parameters describing the spectral envelope at each stage. FIG. 2b shows a schematic example of a sequence of spectral envelopes 2041, 2042, 2043, etc. varying over time. Once the varying spectral envelope is removed, the remaining signal representative of the source alone may be referred to as the LPC residual signal, as shown schematically in FIG. 2a. The short-term filter works by removing short-term correlations (i.e. short term compared to the pitch period), leading to an LPC residual with less energy than the speech signal.
The spectral envelope signal and the source signal are each encoded separately for transmission. In the illustrated example, each subframe 106 would contain: (i) a set of parameters representing the spectral envelope 204; and (ii) an LPC residual signal representing the source signal 202 with the effect of the short-term correlations removed.
To improve the encoding of the source signal, its periodicity may be exploited. To do this, a long-term prediction (LTP) analysis is used to determine the correlation of the LPC residual signal with itself from one period to the next, i.e. the correlation between the LPC residual signal at the current time and the LPC residual signal after one period at the current pitch lag (correlation being a statistical measure of a degree of relationship between groups of data, in this case the degree of repetition between portions of a signal). In this context the source signal can be said to be “quasi” periodic in that on a timescale of at least one correlation calculation it can be taken to have a meaningful period which is approximately (but not exactly) constant; but over many such calculations then the period and form of the source signal may change more significantly. A set of parameters derived from this correlation are determined to at least partially represent the source signal for each subframe. The set of parameters for each subframe is typically a set of coefficients of a series, which form a respective vector.
The effect of this inter-period correlation is then removed from the LPC residual, leaving an LTP residual signal representing the source signal with the effect of the correlation between pitch periods removed. To represent the source signal, the LTP vectors and LTP residual signal are encoded separately for transmission.
The sets of LPC parameters, the LTP vectors and the LTP residual signal are each quantized prior to transmission (quantization being the process of converting a continuous range of values into a set of discrete values, or a larger approximately continuous set of discrete values into a smaller set of discrete values). The advantage of separating out the LPC residual signal into the LTP vectors and LTP residual signal is that the LTP residual typically has a lower energy than the LPC residual, and so requires fewer bits to quantize.
So in the illustrated example, each subframe 106 would comprise: (i) a quantized set of LPC parameters representing the spectral envelope, (ii)(a) a quantized LTP vector related to the correlation between pitch periods in the source signal, and (ii)(b) a quantized LTP residual signal representative of the source signal with the effects of this inter-period correlation removed.
Prior to transmission, the quantized values are encoded.
Pyramid vector coding is a lossless enumeration coding technique that provides efficient encoding for integer values with a Laplacian probability distribution, where the probability of an integer value decreases exponentially with its absolute value. Pyramid vector coding is commonly used in transform coding and sub band coding of still and moving images and in audio transform coding. For these coding methods, the transform or sub band coefficients have approximately a Laplacian probably distribution, making Pyramid vector coding an efficient method.
Pyramid vector coding operates on a block of L quantization indices q(n), typically produced by scalar, lattice or trellis quantizing transform coefficients. In one implementation of Pyramid vector coding, the first step is to convert the block of quantization indices into a block of sign values s(n) and a block of absolute values u(n). The sign values corresponding to nonzero quantization indices are encoded with a simple two-level entropy coder. The absolute values are summed together to produce the radius K
      K    =                  ∑                  n          =          1                L            ⁢              u        ⁡                  (          n          )                      ,
which is indicated to the decoder separately.
Pyramid vector coding represents the block of absolute values u(n) as a distribution of K unit pulses over the L samples. The number of possible such distributions is denoted by N(L,K) and can be computed recursively using
            N      ⁡              (                  l          ,          k                )              =                  N        ⁡                  (                                    l              -              1                        ,            k                    )                    +              N        ⁡                  (                      l            ,                          k              -              1                                )                      where      k    =                  ∑                  n          =          1                l            ⁢              u        ⁡                  (          n          )                    
withN(l,0)=1andN(1,k)=1.
The encoding process computes an index b for one of the N(L,K) distributions, according to the following pseudo code.
Init: b=0;                k=K;        l=L;        
for n=1 . . . L                b=b+N(l, k)−N(l, k−u(n));        k=k−u(n);        l=l−1        
end
The results is an index b, with 0<=b<N(L,K). For efficiency reasons, the N(l,k) values are often stored in a ROM table of size LKmax so that the recursive computation of N(l,k) is avoided.
The index b is decoded according to the pseudo code
Init: k=K;                l=L;        
for n=1 . . . L                u(n)={the smallest integer value j such that N(l, k)−N(l, k−j)>b};        b=b−N(l, k)+N(l, k−u(n));        k=k−u(n);        l=l−1;        
end
Every index corresponds to a unique distribution, and each distribution is coded with the same bitrate. In practice, the signal that is being encoded may not have a Laplacian probability distribution, and therefore each distribution is not equally likely. It has been observed that for optimum coding efficiency, some distributions should in that case be coded at a lower bitrate than others. In linear predictive speech coding for instance, a residual signal is encoded that has an approximately Gaussian probability distribution. Quantizing and encoding such a residual with an entropy coder for Laplacian probability distribution reduces coding efficiency, leading to a higher bitrate.
In another implementation of pyramid vector coding, the—possibly negative—quantization indices themselves are encoded, without first converting the quantization indices in sign values and absolute values.
Similar enumeration coding techniques exist such as Conditional Product Code, Factorial Packing and Conditional Product-Product Code, which all encode quantization indices efficiently if the quantization indices have a Laplacian probability distribution.
In predictive speech coding, sometimes the number of unit pulses per block is fixed. In that case, the radius K is not transmitted to the decoder. Alternatively, only a few values are allowed for the radius K, which reduces the bitrate for encoding the radius compared to a radius K that is unconstrained up to a maximum value Kmax.
It is desirable to provide an improved encoding technique for encoding quantization values in speech transmission.