The invention relates to electronic devices, and, more particularly, to encoding and decoding with algebraic codebooks and systems employing such algebraic codebooks.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (VolP) transmission benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients a(j), j=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)−ΣM≧j≧1a(j)s(n−j)  (1)and minimizing Σr(n)2. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network (PSTN) sampling for digital transmission); and the number of samples {s(n)} in a frame is often 80 or 160 (10 or 20 ms frames). Various windowing operations may be applied to the samples of the input speech frame. The name “linear prediction” arises from the interpretation of r(n)=s(n)−ΣM≧j≧1 a(j)s(n−j) as the error in predicting s(n) by the linear combination of preceding speech samples ΣM≧j≧1 a(j)s(n−j). Thus minimizing Σr(n)2 yields the set of coefficients {a(j)} which furnish the best linear prediction. The coefficients {a(j)} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage.
The {r(n)} form the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an LP excitation from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) excitation (waveform or parameters such as pitch), and the (quantized) gain. A receiver regenerates the speech with the same perceptual characteristics as the input speech. FIGS. 5-6 show the high level blocks in an LP system. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
Indeed, the ITU standard G.729 with a bit rate of 8 kb/s uses LP analysis with code excitation (CELP) to compress voiceband speech and has performance essentially equivalent to the 32 kb/s ADPCM of ITU standard G.726. FIG. 2 illustrates CELP synthesis. The excitation in G.729 consists of the sum of an adaptive codebook contribution and a fixed (algebraic) codebook contribution; FIGS. 3-4 show the generic encoder and decoder. The adaptive codebook contribution provides periodicity (pitch) for the excitation, and the algebraic codebook contribution provides the remainder. Each algebraic codebook vector contains four ±1 pulses with one pulse in each of four interleaved tracks of 8 or 16 positions, the tracks make up the 40 component vector corresponding to a 40 sample subframe excitation. Indeed, the excitation for a subframe will roughly be the sum of a gain times the prior subframe's excitation but time shifted by a pitch delay plus a gain times the algebraic codebook vector. In more detail, the algebraic codebook vector has 40 positions (labeled 0 through 39) with one ±1 pulse among the eight positions 0, 5, 10, 15, 20, 25, 30, and 35 which make up track 0; one ±1 pulse among the eight positions 1, 6, 11, 16, 21, 26, 31, and 36 which constitute track 1; one ±1 pulse among the eight components 2, 7, 12, 17, 22, 27, 32, and 37 forming track 2; and one ±1 pulse among the 16 positions 3, 4, 8, 9, 13, 14, 18, 19, 23, 24, 28, 29, 33, 34, 38, and 39 forming track 3. All 36 positions without pulses equal 0. Note that this splitting of the 40 positions into four interleaved tracks with one ±1 pulse in each track somewhat reduces the possible positions of four ±1 pulses among the 40 positions but greatly reduces the number of bits required to encode the pulses. In fact, the location of a pulse among eight positions takes 3 bits, the location of a pulse among 16 positions takes 4 bits, and the sign of each pulse takes 1 bit; thus the total to encode the vector is 17 bits. In contrast, a pulse position among 40 components takes 6 bits and again a sign of a pulse takes 1 bit, thus the total to encode four ±1 pulses located anywhere in the 40 positions would take 28 bits.
Similarly, the GSM Enhanced Full Rate (EFR) standard uses CELP including algebraic codebook vectors having a total of ten pulses in a 40-position vector with two ±1 pulses on each of five interleaved tracks, each track has eight positions for the 40-sample excitation. That is, there are two ±1 pulses located among the eight positions 0, 5, 10, 15, 20, 25, 30, and 35; two ±1 pulses among the eight positions 1, 6, 11, 16, 21, 26, 31, and 36; two ±1 pulses among the eight positions 2, 7, 12, 17, 22, 27, 32, and 37; two ±1 pulses among the eight positions 3, 8, 3, 18, 23, 28, 33, and 38; two ±1 pulses among the eight positions 4, 9, 14, 19, 24, 29, 34, and 39. The vector equals 0 at the 30 non-pulse positions. This appears to require 40 bits, but the encoding of the sign bits can be reduced from 2 bits for two pulses on the same track to only 1 bit as follows. A single sign bit indicates the sign of the first transmitted pulse position within the track; and the sign of the second transmitted pulse depends upon its position relative to that of the first pulse: if the position of the second pulse is smaller (precedes) that of the first pulse, then the second pulse has the opposite sign, otherwise it has the same sign. Thus 5 bits are saved. Note that two pulses may have the same position (in effect one pulse of twice the amplitude).
In general, with 2n pulses per track in an algebraic codebook, only n sign bits are needed because the pulses can be paired with the first pulse in a pair having the sign bit and the second pulse in the pair having the opposite or same sign according to relative pulse position.
Further, CELP codecs with algebraic codebooks have been proposed for wideband speech and audio coding at rates such as 16 kb/s and 24 kb/s. However, the algebraic codebook vectors still require too many bits for encoding more than two pulses per track.