An acoustic coding technology which compresses a music signal or speech signal at a lowbit rate is important for effective utilization of a transmission path capacity of radio wave, etc., in a mobile communication and a recording medium. As speech coding methods for coding a speech signal, there are methods like G726, G729 which are standardized by the ITU (International Telecommunication Union). These methods can perform coding on a narrowband signal (300 Hz to 3.4 kHz) at a bit rate of 8 kbit/s to 32 kbit/s with high quality.
Furthermore, there are standard methods for coding a wideband signal (50 Hz to 7 kHz) like G722, G722.1 of the ITU and AMR-WB of the 3GPP (The 3rd Generation Partnership Project). These methods can perform coding on a wideband speech signal at a bit rate of 6.6 kbit/s to 64 kbit/s with high quality.
A method for effectively performing coding on a speech signal at a low bit rate with a high degree of efficiency is CELP (Code Excited Linear Prediction). Based on an engineering simulating model of a human speech generation model, the CELP is a method of causing an excitation signal expressed by a random number or pulse string to pass through a pitch filter corresponding to the intensity of periodicity and a synthesis filter corresponding to a vocal tract characteristic and determining coding parameters so that the square error between the output signal and input signal becomes a minimum under weighting of a perceptual characteristic. (For example, see “Code-Excited Linear Prediction (CELP): high quality speech at very low bit rates”, Proc. ICASSP 85, pp. 937-940, 1985.)
Many recent standard speech coding methods are based on the CELP. For example, G729 can perform coding on a narrowband signal at a bit rate of 8 kbit/s and AMR-WB can perform coding on a wideband signal at a bit rate of 6.6 kbit/s to 23.85 kbit/s.
On the other hand, in the case of audio coding where a music signal is encoded, transform coding is generally used which transforms a music signal to a frequency domain and encodes the transformed coefficients using a perceptual psychological model such as a MPEG-1 layer 3 coding and AAC coding standardized by MPEG (Moving Picture Expert Group). These methods are known to hardly produce deterioration at a bit rate of 64 kbit/s to 96 kbit/s per channel on a signal having a sampling rate of 44.1 kHz.
However, when a signal which consists predominantly of a speech signal with music and environmental sound superimposed in the background is encoded, applying a speech coding involves a problem that not only the signal in the background but also the speech signal deteriorates due to the influence of music and environmental sound in the background, degrading the overall quality. This is a problem caused by the fact that the speech coding is based on a method specialized for the speech model of the CELP. Furthermore, there is another problem that the signal band to which the speech coding is applicable is up to 7 kHz at most and signals having higher frequencies cannot be covered for structural reasons.
On the other hand, music coding (audio coding) methods allow high quality coding on music, and can thereby obtain sufficient quality for the aforementioned speech signal including music and environmental sound in the background, too. Furthermore, audio coding is applicable to a frequency band of target signals having a sampling rate of up to approximately 22 kHz, which is equivalent to CD quality.
On the other hand, to realize high quality coding, it is necessary to use signals at a high bit rate and the problem is that if the bit rate is mitigated to as low as approximately 32 kbit/s, the quality of the decoded signal degrades drastically. This results in a problem that the method cannot be used for a communication network having a low transmission bit rate.
In order to avoid the above described problems, it is possible to adopt scalable coding combining these technologies which performs coding on an input signal in a base layer using CELP first and then calculates a residual signal obtained by subtracting the decoded signal from the input signal and carries out transform coding on this signal in an enhancement layer.
According to this method, the base layer uses CELP and can thereby perform coding on a speech signal with high quality and the enhancement layer can efficiently perform coding on music and environmental sound in the background which cannot be expressed by the base layer and signals with a higher frequency component than the frequency band covered by the base layer. Furthermore, according to this configuration, it is possible to suppress the bit rate to a low level. In addition, this configuration allows an acoustic signal to be decoded from only part of a coded code, that is, a coded code of the base layer and such a scalable function is effective in realizing multicasting to a plurality of networks having different transmission bit rates.
However, such scalable coding has a problem that delays in the enhancement layer increase. This problem will be explained using FIG. 1 and FIG. 2. FIG. 1 illustrates an example of frames of a base layer (base frames) and frames of an enhancement layer (enhancement frames) in conventional speech coding. FIG. 2 illustrates an example of frames of a base layer (base frames) and frames of an enhancement layer (enhancement frames) in conventional speech decoding.
In the conventional speech coding, the base frames and enhancement frames are constructed of frames having an identical time length. In FIG. 1, an input signal input from time T(n−1) to T(n) becomes an nth base frame and is encoded in the base layer. And a residual signal from time T(n−1) to T(n) is also coded in the enhancement layer.
Here, when an MDCT (modified discrete cosine transform) is used in the enhancement layer, it is necessary to make two successive MDCT analysis frames overlap with each other by half the analysis frame length. This overlapping is performed to prevent discontinuity between the frames in the synthesis process.
In the case of an MDCT, an orthogonal basis is designed to hold orthogonally not only within an analysis frame but also between successive analysis frames, and therefore overlapping successive analysis frames with each other and adding up the two in the synthesis process prevents distortion from occurring due to discontinuity between frames. In FIG. 1, the nth analysis frame is set to a length of T(n−2) to T(n) and coding processing is performed.
Decoding processing generates a decoded signal consisting of the nth base frame and the nth enhancement frame. The enhancement layer performs an IMDCT (inverse modified discrete cosine transform) and as described above, it is necessary to overlap the decoded signal of the nth enhancement frame with the decoded signal of the preceding frame (the (n−1)th enhancement frame in this case) by half the synthesized frame length and add up the two. For this reason, the decoding processing section can only generate up to the signal at time T(n−1).
That is, a delay (time length of T(n)−T (n−1) in this case) of the same length as that of the base frame as shown in FIG. 2 occurs. If the time length of the base frame is assumed to be 20 ms, a newly produced delay in the enhancement layer is 20 ms. Such an increase of delay constitutes a serious problem in realizing a speech communication service.
As shown above, the conventional apparatus has a problem that it is difficult to perform coding on a signal which consists predominantly of speech with music and noise superimposed in the background, with a short delay, at a low bit rate and with high quality.