As a scheme capable of efficiently encoding a speech signal or music signal in a full band (FB) of 0.02 to 20 kHz, there is a technique standardized in ITU-T (International Telecommunication Union Telecommunication Standardization Sector). This technique transforms an input signal into a frequency-domain signal and encodes a band of up to 20 kHz (transform coding).
Here, transform coding is a coding scheme that transforms an input signal from a time domain into a frequency domain using time/frequency transformation such as discrete cosine transform (DCT) or modified discrete cosine transform (MDCT) to enable a signal to be mapped in precise correspondence with auditory characteristics.
In transform coding, a spectral coefficient is split into a plurality of frequency subbands. In coding of each subband, allocating more quantization bits to a band which is perceptually important to human ears makes it possible to improve overall sound quality.
In order to attain this object, studies are being carried out on efficient bit allocation schemes, and for example, a technique disclosed in Non-Patent Literature (hereinafter, referred to as “NPL”) 1 is known. Hereinafter, the bit allocation scheme disclosed in Patent Literature (hereinafter, referred to as “PTL”) 1 will be described using FIG. 1 and FIG. 2.
FIG. 1 is a block diagram illustrating a configuration of a speech/audio coding apparatus disclosed in PTL 1. An input signal sampled at 48 kHz is inputted to transient detector 11 and transformation section 12 of the speech/audio coding apparatus.
Transient detector 11 detects, from the input signal, either a transient frame corresponding to a leading edge or an end edge of speech or a stationary frame corresponding to a speech section other than that, and transformation section 12 applies, to the frame of the input signal, high-frequency resolution transformation or low-frequency resolution transformation depending on whether the frame detected by transient detector 11 is a transient frame or stationary frame, and acquires a spectral coefficient (or transform coefficient).
Norm estimation section 13 splits the spectral coefficient obtained in transformation section 12 into bands of different bandwidths. Norm estimation section 13 estimates a norm (or energy) of each split band.
Norm quantization section 14 determines a spectral envelope made up of the norms of all bands based on the norm of each band estimated by norm estimation section 13 and quantizes the determined spectral envelope.
Spectrum normalization section 15 normalizes the spectral coefficient obtained by transformation section 12 according to the norm quantized by norm quantization section 14.
Norm adjustment section 16 adjusts the norm quantized by norm quantization section 14 based on adaptive spectral weighting.
Bit allocation section 17 allocates available bits for each band in a frame using the quantization norm adjusted by norm adjustment section 16.
Lattice-vector coding section 18 performs lattice-vector coding on the spectral coefficient normalized by spectrum normalization section 15 using bits allocated for each band by bit allocation section 17.
Noise level adjustment section 19 estimates the level of the spectral coefficient before coding in lattice-vector coding section 18 and encodes the estimated level. A noise level adjustment index is obtained in this way.
Multiplexer 20 multiplexes a frame configuration of the input signal acquired by transformation section 12, that is, a transient signal flag indicating whether the frame is a stationary frame or transient frame, the norm quantized by norm quantization section 14, the lattice coding vector obtained by lattice-vector coding section 18 and the noise level adjustment index obtained by noise level adjustment section 19, and forms a bit stream and transmits the bit stream to a speech/audio decoding apparatus.
FIG. 2 is a block diagram illustrating a configuration of the speech/audio decoding apparatus disclosed in PTL 1. The speech/audio decoding apparatus receives the bit stream transmitted from the speech/audio coding apparatus and demultiplexer 21 demultiplexes the bit stream.
Norm de-quantization section 22 de-quantizes the quantized norm, acquires a spectral envelope made up of norms of all bands, and norm adjustment section 23 adjusts the norm de-quantized by norm de-quantization section 22 based on adaptive spectral weighting.
Bit allocation section 24 allocates available bits for each band in a frame using the norms adjusted by norm adjustment section 23. That is, bit allocation section 24 recalculates bit allocation indispensable to decode the lattice-vector code of the normalized spectral coefficient.
Lattice decoding section 25 decodes a transient signal flag, decodes the lattice coding vector based on a frame configuration indicated by the decoded transient signal flag and the bits allocated by bit allocation section 24 and acquires a spectral coefficient.
Spectral-fill generator 26 regenerates a low-frequency spectral coefficient to which no bit has been allocated using a codebook created based on the spectral coefficient decoded by lattice decoding section 25. Spectral-fill generator 26 adjusts the level of the spectral coefficient regenerated using a noise level adjustment index. Furthermore, spectral-fill generator 26 regenerates a high-frequency uncoded spectral coefficient using a low-frequency coded spectral coefficient.
Adder 27 adds up the decoded spectral coefficient and the regenerated spectral coefficient, and generates a normalized spectral coefficient.
Envelope shaping section 28 applies the spectral envelope de-quantized by norm de-quantization section 22 to the normalized spectral coefficient generated by adder 27 and generates a full-band spectral coefficient.
Inverse transformation section 29 applies inverse transform such as inverse modified discrete cosine transform (IMDCT) to the full-band spectral coefficient generated by envelope shaping section 28 to transform it into a time-domain signal. Here, inverse transform with high-frequency resolution is applied to a case with a stationary frame and inverse transform with low-frequency resolution is applied to a case with a transient frame.
In G.719, the spectral coefficients are split into spectrum groups. Each spectrum group is split into bands of equal length sub-vectors as shown in FIG. 3. Sub-vectors are different in length from one group to another and this length increases as the frequency increases.
Regarding transform resolution, higher frequency resolution is used for low frequencies, while lower frequency resolution is used for high frequencies. As described in G.719, the grouping allows an efficient use of the available bit-budget during encoding.
In G.719, the bit allocation scheme is identical in a coding apparatus and a decoding apparatus. Here, the bit allocation scheme will be described using FIG. 4.
As shown in FIG. 4, in step (hereinafter abbreviated as “ST”) 31, quantized norms are adjusted prior to bit allocation to adjust psycho-acoustical weighting and masking effects.
In ST32, subbands having a maximum norm are identified from among all subbands and in ST33, one bit is allocated to each spectral coefficient for the subbands having the maximum norm. That is, as many bits as spectral coefficients are allocated.
In ST34, the norms are reduced according to the bits allocated, and in ST35, it is determined whether the remaining number of allocatable bits is 8 or more. When the remaining number of allocatable bits is 8 or more, the flow returns to ST32 and when the remaining number of allocatable bits is less than 8, the bit allocation procedure is terminated.
Thus, in the bit allocation scheme, available bits within a frame are allocated among subbands using the adjusted quantization norms. Normalized spectral coefficients are encoded by lattice-vector coding using the bits allocated to each subband.
NPL 1
    ITU-T Recommendation G.719, “Low-complexity full-band audio coding for high-quality conversational applications,” ITU-T, 2009.
However, the above bit allocation scheme does not take into consideration input signal characteristics when grouping spectral bands, and therefore has a problem in that efficient bit allocation is not possible and further improvement of sound quality cannot be expected.