Mobile communication systems are required to compress and transmit speech signals at a low bit rate, in order to effectively utilize radio wave resources. At the same time, the mobile communication systems are required to improve the quality of telephone speech and provide telephone services enabling vivid communication. To achieve this, it is desirable to not only improve the quality of speech signals but also encode, with high quality, even signals other than the speech signals, such as music signals having a wider bandwidth.
A promising technique for approaching these two contradictory requirements involves hierarchically integrating a plurality of coding techniques. This technique uses a hierarchical combination of a first layer and a second layer: the first layer encodes an input signal at a low bit rate on the basis of a model suited to a speech signal, and the second layer encodes a differential signal between the input signal and a decoded signal of the first layer on the basis of a model suited to signals other than the speech signal. Such technique of hierarchical coding is generally referred to as scalable coding (layer coding) because a bit stream obtained by a coding apparatus exhibits scalability, or a property that a decoded signal can be obtained even from information on part of the bit stream.
Such scalable coding system can flexibly deal with communication between networks having different bit rates in its nature, and thus can be regarded as suitable for future network environments in which variety of networks will be integrated through IP protocols.
A technique is disclosed in NPL 1 as an example in which the scalable coding is implemented using a technique standardized by Moving Picture Experts Group phase-4 (MPEG-4). This technique uses, in a first layer, code excited linear prediction (CELP) coding suited to a speech signal, and in a second layer, transform coding, such as advanced audio coder (AAC) or transform domain weighted interleave vector quantization (TwinVQ), is performed on a residual signal obtained by subtracting a first layer decoded signal from the original signal.
With the use of such a scalable configuration, the quality of speech signals and the quality of music signals and other such signals having a wider bandwidth than that of the speech signals can be improved.
In the case where the transform coding is applied to at least one layer in the layer coding as described above, coding distortion that is caused by the transform coding at the start point (or the end point) of the speech signal propagates over an entire frame, and this coding distortion unfavorably decreases the sound quality. The coding distortion caused at this time is referred to as pre-echo (or post-echo).
FIG. 1 shows a state where a decoded signal is generated in the case of encoding and decoding the start point of a speech signal with the use of scalable coding including two layers. Here, the first layer adopts CELP in which an excitation signal is encoded for each sub-frame of 5 ms, and the second layer adopts transform coding performed for each frame of 20 ms.
In the case as the first layer where the time length of a signal as a coding target is as short as 5 ms, the coding interval is short, and hence such a case is hereinafter referred to as “the temporal resolution is high”. In the case as the second layer where the time length of a signal as a coding target is as long as 20 ms, the coding interval is long, and hence such a case is hereinafter referred to as “the temporal resolution is low”.
In the first layer, a decoded signal can be generated on a 5-ms basis, and hence the propagation of coding distortion falls within merely 5 ms (see FIG. 1(a)). On the other hand, in the second layer, coding distortion propagates in a wide range of 20 ms. Originally, the first half part of this frame corresponds to inactive speech, and a second layer decoded signal needs to be generated only in the latter half part of this frame. Nevertheless, if the bit rate cannot be made sufficiently high, a waveform appears also in the first half part due to the coding distortion (see FIG. 1(b)). In general, in order to obtain high coding efficiency in the transform coding, the frame length needs to be set to 20 ms or more. Accordingly, the temporal resolution is lower than that of CELP, which is disadvantageous.
When a final decoded signal is calculated by adding the first layer decoded signal to the second layer decoded signal, the coding distortion remains in section A of the decoded signal (see FIG. 1(c)), resulting in a decrease in sound quality. Such a phenomenon occurs at the start point of a speech signal (or a music signal), and this coding distortion is referred to as pre-echo. Note that similar coding distortion occurs also at the end point of a speech signal (or a music signal), and this coding distortion is referred to as post-echo.
A method for avoiding the occurrence of such pre-echoes involves detecting the start point of a speech signal and switching, if the start point is detected, to a process of making the frame length (analysis length) of transform coding shorter. PTL 1 discloses a start point detecting method in which: the start point of a speech signal is detected on the basis of a temporal change in gain information of CELP in a first layer; and information on the detected start point is reported to a second layer.
In this way, the temporal resolution is increased by making the analysis length at the start point shorter. As a result, the propagation of coding distortion can be suppressed to be low, and the occurrence of pre-echoes can be avoided.
The above-mentioned method, however, requires switching of the analysis lengths, a frequency transforming method suited to the two analysis lengths, and a quantization method for transform coefficients, and hence the complexity of processing is unfavorably increased.
In addition, PTL 1 does not disclose a specific method for avoiding pre-echoes using information on the detected start point, and hence the pre-echoes cannot be avoided.
Meanwhile, PTL 2 discloses a method for avoiding the occurrence of pre-echoes, the method in which an amplification factor by which each decoded signal is to be multiplied is obtained on the basis of an energy envelope relation of the decoded signals of a first layer and a second layer; and each decoded signal is multiplied by the obtained amplification factor.