The invention relates to electronic devices, and, more particularly, to speech coding, transmission, storage, and decoding/synthesis methods and systems.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (VoIP) transmission benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients at), j=1,2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)−ΣM≧j≧1 a(j)s(n−j)  (1)and minimizing Σr(n)2. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network (PSTN) sampling for digital transmission); and the number of samples {s(n)} in a frame is often 80 or 160 (10 or 20 ms frames). Various windowing operations may be applied to the samples of the input speech frame. The name “linear prediction” arises from the interpretation of r(n)=s(n)−ΣM≧j≧1 a(j)s(n−j) as the error in predicting s(n) by the linear combination of preceding speech samples ΣM≧j≧1 a(j)s(n−j). Thus minimizing Σr(n)2 yields the {a(j)} which furnish the best linear prediction. The coefficients {a(j)} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage.
The {r(n)} form the LP residual for the frame, and ideally LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an LP excitation from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) residual (waveform or parameters such as pitch), and the (quantized) gain. A receiver regenerates the speech with the same perceptual characteristics as the input speech. FIG. 9 shows the blocks in an LP system. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
Indeed, the ITU standard G.729 Annex E with a bit rate of 11.8 kb/s uses LP analysis with codebook excitation (CELP) to compress voiceband speech and has performance comparable to the 64 kb/s PCM used for PSTN digital transmission.
However, the quality of even the G.729 Annex E standard does not meet the demand for high quality speech systems, and various proposals extend the coding to wideband (e.g., 0-7 kHz) speech without too large an increase in transmission bit rate.
The direct approach of applying LP coding to the full 0-8 kHz wideband increases the bit rate too much or degrades the quality. One alternative approach simply extrapolates from the (coded) 0-4 kHz lowband to a create a 4-8 kHz highband signal; see Chan et al, Quality Enhancement of Narrowband CELP-Coded Speech via Wideband Harmonic Re-Synthesis, IEEE ICASSP 1997, pp.1187-1190. Another approach uses split-band CELP or MPLPC by coding a 4-8 kHz highband separately from the 0-4 kHz lowband and with fewer bits allocated to the highband; see Drogo de Jacovo et al, Some Experiments of 7 kHz Audio Coding at 16 kbit/s, IEEE ICASSP 1989, pp. 192-195. Similarly, Tucker, Low Bit-Rate Frequency Extension Coding, IEE Colloquium on Audio and Music Technology 1998, pp. 3/1-3/5, provides standard coding of the lowband 0-4 kHz plus codes the 4-8 kHz highband speech only for unvoiced frames (as determined in the lowband) and uses an LP filter of order 2-4 with noise excitation. However, these approaches suffer from either too high a bit rate or too low a quality.