I. Technical Field
This disclosure generally relates to digital signal processing, and more specifically, to techniques for encoding and decoding audio signals for storage and/or communication.
II. Background
In digital communications, signals are typically coded for transmission and decoded for reception. Coding of signals concerns converting the original signals into a format suitable for propagation over a transmission medium. The objective is to preserve the quality of the original signals, but at a low consumption of the medium's bandwidth. Decoding of signals involves the reverse of the coding process.
A known coding scheme uses the technique of pulse-code modulation (PCM). FIG. 1 shows a time-varying signal x(t) that can be a segment of a speech signal, for instance. The y-axis and the x-axis represent the signal amplitude and time, respectively. The analog signal x(t) is sampled by a plurality of pulses 20. Each pulse 20 has an amplitude representing the signal x(t) at a particular time. The amplitude of each of the pulses 20 can thereafter be coded in a digital value for later transmission.
To conserve bandwidth, the digital values of the PCM pulses 20 can be compressed using a logarithmic companding process prior to transmission. At the receiving end, the receiver merely performs the reverse of the coding process mentioned above to recover an approximate version of the original time-varying signal x(t). Apparatuses employing the aforementioned scheme are commonly called the a-law or μ-law codecs.
As the number of users increases, there is a further practical need for bandwidth conservation. For instance, in a wireless communication system, a multiplicity of users are often limited to sharing a finite amount frequency spectrum. Each user is normally allocated a limited bandwidth among other users. Thus, as the number of users increases, so does the need to further compress digital information in order to converse the bandwidth available on the transmission channel.
For voice communications, speech coders are frequently used to compress voice signals. In the past decade or so, considerable progress has been made in the development of speech coders. A commonly adopted technique employs the method of code excited linear prediction (CELP). Details of CELP methodology can be found in publications, entitled “Digital Processing of Speech Signals,” by Rabiner and Schafer, Prentice Hall, ISBN: 0132136031, September 1978; and entitled “Discrete-Time Processing of Speech Signals,” by Deller, Proakis and Hansen, Wiley-IEEE Press, ISBN: 0780353862, September 1999. The basic principles underlying the CELP method is briefly described below.
Referring to FIG. 1, using the CELP method, instead of digitally coding and transmitting each PCM sample 20 individually, the PCM samples 20 are coded and transmitted in groups. For instance, the PCM pulses 20 of the time-varying signal x(t) in FIG. 1 are first partitioned into a plurality of frames 22. Each frame 22 is of a fixed time duration, for instance 20 ms. The PCM samples 20 within each frame 22 are collectively coded via the CELP scheme and thereafter transmitted. Exemplary frames of the sampled pulses are PCM pulse groups 22A-22C shown in FIG. 1.
For simplicity, take only the three PCM pulse groups 22A-22C for illustration. During encoding prior to transmission, the digital values of the PCM pulse groups 22A-22C are consecutively fed to a linear predictor (LP) module. The resultant output is a set of coefficient and residual values, which basically represents the spectral content of the pulse groups 22A-22C. The LP filter is then quantized.
The LP module generates an approximation of the spectral representation of the PCM pulse groups 22A-22C. As such, during the predicting process, the residual values, or prediction errors, are introduced. The residual values are mapped to a codebook which carries entries of various combinations available for close matching of the coded digital values of the PCM pulse groups 22A-22C. The best fitted values in the codebook are mapped. The mapped values are the values to be transmitted.
Thus, using the CELP method in telecommunications, the encoder (not shown) merely has to generate the coefficients and the mapped codebook values. The transmitter needs only to transmit the coefficients and the mapped codebook values, instead of the individually coded PCM pulse values as in the a- and μ-law encoders mentioned above. Consequently, substantial amount of communication channel bandwidth can be saved.
On the receiver end, it also has a codebook similar to that in the transmitter. The decoder in the receiver, relying on the same codebook, merely has to reverse the encoding process as aforementioned. By also applying the received filter coefficients, the time-varying signal x(t) can be recovered.
Heretofore, many of the known speech coding schemes, such as the CELP scheme mentioned above, are based on the assumption that the signals being coded are short-time stationary. That is, the schemes are based on the premise that frequency contents of the coded frames are stationary and can be approximated by simple (all-pole) filters and some input representation in exciting the filters. Various time domain linear prediction (TDLP) algorithms, in arriving at the codebooks as mentioned above, are based on such a model. Nevertheless, voice patterns among individuals can be very different. Non-speech audio signals, such as sounds emanated from various musical instruments, are also distinguishably different from speech signals. Furthermore, in the CELP process as described above, to expedite real-time signal processing, a short time frame is normally chosen. More specifically, as shown in FIG. 1, to reduce algorithmic delays in the mapping of the values of the PCM pulse groups, such as 22A-22C, to the corresponding entries of vectors in the codebook, a short time window 22 is defined, for example 20 ms as shown in FIG. 1. However, derived spectral or formant information from each frame is mostly common and can be shared among other frames. Consequently, the formant information is more or less repetitively sent through the communication channels, in a manner not in the best interest for bandwidth conservation.
As an improvement over TLDP algorithms, frequency domain linear prediction (FDLP) schemes have been developed to improve preservation of signal quality, applicable not only to human speech, but also to a variety of other sounds, and further, to more efficiently utilize communication channel bandwidth. FDLP-based coding schemes operate by predicting the temporal evolution of spectral envelopes. FDLP is the basically a frequency-domain analogue of TLDP; however, FDLP coding and decoding schemes are capable processing much longer temporal frames when compared to TLDP. Similarly to how TLDP fits an all-pole model to the power spectrum of an input signal, FDLP fits an all-pole model to the squared Hilbert envelope of an input signal.