High quality coding of acoustical signals at low bit rates is of pivotal importance to communications systems such as mobile telephony, secure telephone, and voice storage. In recent years, there has been a strong trend in mobile telephony towards improved quality of the reconstructed acoustical signal and towards increased flexibility in the bit rate required for transmission. The trend towards improved quality reflects, on the one hand, the customer expectation that mobile telephony provides a quality equal to that of the regular telephone network. Particularly important in this respect is the performance for background signals and music. The trend towards flexibility in bit rate reflects, on the other hand, the desire of the service providers to operate near the network capacity without the risk of having to drop calls, and possibly to have different service levels with different cost. The ability to strip bits from an existing bit stream while maintaining the ability to reconstruct the speech signal (albeit at a lower accuracy) is an especially useful type of bit rate flexibility.
With existing speech coding technology, it is difficult to meet the simultaneous challenge of improved acoustic signal quality and increased flexibility in bit rate. This difficulty is the direct result of the structure of the linear-prediction based analysis-by-synthesis (LPAS) paradigm which is commonly used in mobile telephony. Currently, LPAS coders perform better in coding speech at rates between 5 and 20 kb/s than other technologies. Accordingly, the LPAS paradigm forms the basis of virtually every digital telephony standard, including GSM, D-AMPS, and PDC. However, while the performance for speech is good, current LPAS-based speech coders do not perform as well for music and background noise signals. Furthermore, the ability to strip bits from an existing bit stream until now implied the usage of relatively low efficiency algorithms.
The LPAS coding paradigm does not perform as well for non-speech sounds because it is optimized for the description of speech. Thus, the shape of the short-term power spectrum is described as the multiplication of a spectral envelope, which is described by an all-pole model (with almost always 10 poles), and the so-called spectral fine structure, which is a combination of two components which are harmonic and noise-like in character, respectively. In practice, it is found that this model is not sufficient for many music and background-noise signals. The model shortcomings manifest themselves in perceptually inadequate descriptions of the spectral valleys (zeros), peaks which are not part of the harmonic structure in an otherwise periodic signal, and a so-called "swirling" effect in steady background noise signals which is probably caused by the time variation in the parameter estimation error.
The two main existing approaches towards developing LPAS algorithms with increased flexibility in the bit rate have significant drawbacks. In the first approach, one simply combines a number of coders operating at different bit rates and selects one coder for a particular coding time segment (examples of this first approach are the TIA IS-95 and the more recent IS-127 standards). These types of coders will be referred to as "multi-rate" coders. The disadvantage of this method is that the signal reconstruction requires the arrival at the receiver of the entire bit stream of the selected coder. Thus, the bit stream cannot be altered after it leaves the transmitter.
In the second approach, embedded coding, the encoder produces a composite bit stream made up out of two or more separate bit streams: a primary bit stream which contains a basic description of the signal, and one or more auxiliary bit streams which contain information to enhance the basic signal description. In the LPAS setting, this second approach is implemented by a decomposition of the excitation signal of the LPAS coder into a primary excitation and one or more auxiliary excitations, which enhance the excitation. However, to maintain synchronicity between the encoder and decoder (fundamental for the LPAS paradigm) at all rates, the long-term predictor (present in virtually all LPAS paradigms) can only operate on the primary excitation. Since the long-term predictor provides the most significant part of the coding gain in the LPAS paradigm, this severely limits the benefit of the auxiliary excitations. Thus, these embedded LPAS coding algorithms provide increased bit rate flexibility at the expense of significantly curtailed coding efficiency.
For coders with fixed bit rates between 5 and 20 kb/s, the well-known LPAS paradigm dominates. Overviews of this coding paradigm are provided in, for example, P. Kroon and Ed. F. Deprettere, "A class of analysis-by-synthesis predictive coders for high quality speech coding at rates between 4.8 and 16 kbit/s", IEEE J. Selected Areas Comm., 6:353-363, 1988; A. Gersho, "Advances in speech and audio compression", Proceedings IEEE, 82:900-918, 1994; and P. Kroon and W. B. Kleijn, "Linear-prediction based analysis-by-synthesis coding", In W. B. Kleijn and K. K. Paliwal, editors, Speech Coding and Synthesis, pages 79-119. Elsevier Science Publishers, Amsterdam, 1995.
In the LPAS paradigm, the speech signal is reconstructed by exciting an adaptive synthesis filter with an excitation signal. The adaptive synthesis filter, which has an all-pole structure, is specified by the so-called linear prediction (LP) coefficients, which are adapted once per subframe (a subframe is typically 2 to 5 ms). The LP coefficients are estimated from the original signal once per frame (10 to 25 ms) and their value for each subframe is computed by interpolation. Information about the LP coefficients is usually transmitted once per frame. The excitation is the sum of two components: the adaptive-codebook (for the present purpose identical to the long-term predictor) contribution, and the fixed-codebook contribution.
The adaptive-codebook contribution is determined by selecting for the present subframe that segment of the past excitation which after filtering with the synthesis filter results in a reconstructed signal which is most similar to the original acoustic signal. The fixed-codebook contribution is the entry from a codebook of excitation vectors which, given the adaptive codebook contribution, renders the reconstructed signal obtained most similar to the original signal. In addition to the above process, the adaptive and fixed-codebook contributions are scaled by a quantized scaling factor.
The above description of the LPAS paradigm is applicable to almost all state-of-the-art coders. Examples of such coders are the 8 kb/s ITU G.729 (see R. Salami, C. Laflamme, J.-P. Adoul, and D. Massaloux, "A toll quality 8 kb/s speech codec for the personal communications system (PCS)", IEEE Trans. Vehic. Techn., 43(3):808-816, 1994; and R. Salami et al., "Description of the proposed ITU-T 8 kb/s speech coding standard", Proc. IEEE Speech Coding Workshop, pages 3-4, Annapolis, Md., 1995) and the GSM enhanced full-rate (GSMEFR) 12.2 kb/s coder (see European Telecommun. Standard Institute (ETSI), "Enhanced Full Rate (EFR) speech transcoding (GSM 06.60)", ETSI Technical Standard 300 726, 1996). Both of these coders perform well for speech signals. However, for music signals both coders contain clearly audible artifacts, more so for the lower-rate coder. For each of these coders the entire bit stream must be obtained by the receiver to allow reconstruction.
The 16 kb/s ITU G.728 coder differs from the above paradigm outline in that the LP parameters are computed from the past reconstructed signal, and thus are not required to be transmitted. This is commonly referred to as backward LP adaptation. Only a fixed codebook is used. In contrast to other coders (which use a linear prediction order of 10), a linear predication order of 50 is used. This high prediction order allows a better performance for non-speech sounds than the G.729 and GSMEFR coders. However, because of the backward adaptive structure, the coder is more sensitive to channel errors than the G.729 and GSMEFR coders, making it less attractive for mobile telephony environments. Furthermore, the entire bit stream must be obtained by the G.728 receiver to allow reconstruction.
The IS-127 of the TIA is a multi-rate coding standard aimed at mobile telephony. While this standard has increased bit-rate flexibility, it does not allow the bit stream to be modified between transmitter and receiver. Thus, the decision about the bit rate must be made in the transmitter. The coding paradigm is slightly different from the above paradigm outline, but these differences (see, e.g., D. Nahumi and W. B. Kelijn, "An improved 8 kb/s RCELP coder", Proc. IEEE Speech Coding Workshop, pages 39-40, Annapolis, Md., 1995; and W. B. Kleijn, P. Kroon, and D. Nahumi, "The RCELP speech coding algorithm", European Trans. on Telecomm., 4(5):573-582, 1994) do not affect the accuracy of non-speech sounds significantly.
Because of the aforementioned constraints on performance with current approaches, there are only very few practical coder designs which allow the bit stream to be modified between transmitter and receiver. Some examples of these approaches are found in: R. Drogo de Iacovo and D. Sereno, "CELP coding at 6.55 kbit/s for digital mobile radio communications", Proc. IEEE Global Telecomm. Conf., page 405.6, 1990; S. Zhang and G. Lockhart, "Embedded scheme for regular pulse excited (RPE) linear predictive coding", Proc. IEEE Interrogatory. Conf. Acoust. Speech Sign. Process., pages 37-40, Detroit, 1995; A. Le Guyader, C. Lamblin, and E. Boursicaut, "Embedded algebraic CELP/VSELP coders for wideband speech coding", Speech Comm., 16(4):219-328, 1995; and B. Tang, A. Shen, A. Alwan, and G. Pottie, "A perceptually-based embedded subband speech coder", IEEE Trans. Speech and Audio Process., 5(2) :131-140, 1997. In all of these examples, the coding efficiency is low compared to fixed-rate coders because either the adaptive codebook is omitted altogether, or because the adaptive codebook operates only on the primary excitation signal. This relatively low performance of LPAS coders in using this approach is illustrated by the usage of a subband coder in recent work on embedded coding (see B. Tang, A. Shen, A. Alwan, and G. Pottie, "A perceptually-based embedded subband speech coder", IEEE Trans. Speech and Audio Process., 5(2) :131-140, 1997). While subband coders do not perform as well at a fixed rate, their performance is apparently competitive when embedded coding systems are needed.
At rates above 16 kb/s, acoustic signal coders tend to be aimed at the coding of music. In contrast to the aforementioned LPAS-based coders, these higher rate coders generally use a higher sampling rate than 8 kb/s. Most of these coders are based on the well-known subband and transform coding principles. A state-of-the-art example of a hybrid multi-rate (16, 24, and 32 kb/s) coder using both linear prediction and transform coding is presented in J.-H. Chen, "A candidate coder for the ITU-T's new wideband speech coding standard", Proc. Interrogatory. Conf. Acoust. Speech Sign. Process., pages 1359-1362, Atlanta, 1997. Examples of higher rate transform and subband coding schemes are given in: K. Gosse, F. Moreau de Saint-Martin, X. Durot, P. Duhamel, and J. B. Rault, "Subband audio coding with synthesis filters minimizing a perceptual distortion", Proc. IEEE Inter. Conf. Acoust. Speech Sign. Process., pages 347-350, Munich, 1997; M. Purat and P. Noll, "Audio coding with dynamic wavelet packet decomposition based on frequency-varying modulated lapped transforms", Proc. IEEE Interrogatory. Conf. Acoust. Speech Sign. Process., pages 1021-1024, Atlanta, 1996; J. Princen and J. Johnston, "Audio coding using signal adaptive filterbanks", Proc. IEEE Interrogatory. Conf. Acoust. Speech Sign. Process., pages 3071-3074, Detroit, 1995; and N. S. Jayant, J. Johnston and R. Safranek, "Signal compression based on models of human perception", Proc. IEEE, 81(10):1385-1421, 1993. Particularly at rates beyond 30 kb/s these coding procedures perform well for music and they can also be expected to do well for background noise. At lower rates, the coders suffer from either tonal or wideband noise. Unfortunately, the higher bit rates are too high for most mobile telephony applications.
At the rates commonly used for mobile telephony (8 to 16 kb/s), the performance of the transform and subband coding algorithms degrades below what can be obtained with LPAS based coding. Because of the lack of long-term feedback, these higher rate algorithms are more suited to embedded coding with conventional techniques than the LPAS coding paradigm, as is illustrated by the procedures given in B. Tang, A. Shen, A. Alwan, and G. Pottie, "A perceptually-based embedded subband speech coder", IEEE Trans. Speech and Audio Process., 5(2):131-140, 1997.
The foregoing discussion illustrates two problems. The first is the relatively low performance of speech coders operating at rates below 16 kb/s, particularly for non-speech sounds such as music. The second problem is the difficulty of constructing an efficient coder (at rates applicable for mobile telephony) which allows the lowering of the bit rate between transmitter and receiver.
The first problem results from the limitations of the LPAS paradigm. The LPAS paradigm is tailored for speech signals, and, in its current form, does not perform well for other signals. While the ITU G.728 coder performs better for such non-speech signals (because it uses backward LP adaptation), it is more sensitive to channel errors, making it less attractive for mobile telephony applications. Higher rate coders (subband and transform coders) do not suffer from the aforementioned quality problems for non-speech sounds, but their bit rates are too high for mobile telephony.
The second problem results from the approach used until now for creating primary and auxiliary bit streams in LPAS coding. In this conventional approach, the excitation signal is separated into primary and auxiliary excitations. Using this approach, the long-term feedback mechanism in the LPAS coder loses in efficiency compared to non-embedded coding systems. As a result, embedded coding is rarely used for LPAS coding systems.
The functionality of the present invention provides for the estimation of enhancement information such as an adaptive equalization operator, which renders an acoustical signal (that has been coded and reconstructed with a primary coding algorithm) more similar to the original signal. The equalization operator modifies the signal by means of a linear or non-linear filtering operation, or a blockwise approximation thereof. The invention also provides the encoding of the adaptive equalization operator, while allowing for some coding error, by means of a bit stream which may be separable from the bit stream of the primary coding algorithm. The invention further provides the decoding of the adaptive equalization operator by the system receiver, and the application, at the receiver, of the decoded adaptive equalization operator to the acoustical signal that has been coded and reconstructed with a primary coding algorithm.
The adaptive equalization operator differs from postfilters (see V. Ramamoorthy and N. S. Jayant, "Enhancement of ADPCM speech by adaptive postfiltering", AT&T Bell Labs. Tech. J., pages 1465-1475, 1984; and J.-H. Chen and A. Gersho, "Adaptive postfiltering for quality enhancement of coded speech", IEEE Trans. Speech Audio Process., 3(1):59-71, 1995) in that a criterion is optimized and in that information concerning the operator is transmitted. The adaptive equalization operator differs from the enhancement methods used in conventional embedded coding in that the equalization operator does not add a correction to the signal. Instead, the equalization operator is typically implemented by filtering with an adaptive filter, or by multiplying short-time spectra with a transfer function. Thus, the correction to the signal is of a multiplicative nature rather than an additive nature.
The invention allows the correction of distortion resulting from the primary encoding/decoding process for primary coders which attempt to model the signal waveform. The structure of the adaptive equalizer operator is generally chosen to address shortcomings of the primary coder structure (for example, the inadequacies in modeling non-speech sounds by LPAS coders). This addresses the first problem mentioned above.
The invention allows increased flexibility in the bit rate. In one embodiment, only the bit stream associated with the primary coder is required for reconstruction of the signal. The auxiliary bit stream associated with the adaptive equalization operator can be omitted anywhere between transmitter and receiver. The reconstructed signal will be enhanced whenever the auxiliary bit stream reaches the decoder. In another embodiment, the bit stream associated with the adaptive equalization operator is required at the receiver and therefore cannot be omitted.