In encoding of speech or sound signals, schemes that perform encoding using linear prediction coefficients obtained by linear prediction analysis of input sound signals are widely employed.
For instance, according to Non-Patent Literatures 1 and 2, input sound signals in each frame are coded by either a frequency domain encoding method or a time domain encoding method. Whether to use the frequency domain or time domain encoding method is determined in accordance with the characteristics of the input sound signals in each frame.
Both in the time domain and frequency domain encoding methods, linear prediction coefficients obtained by linear prediction analysis of input sound signal are converted to a sequence of LSP parameters, which is then coded to obtained LSP codes, and also a quantized LSP parameter sequence corresponding to the LSP codes is generated. In the time domain encoding method, encoding is carried out by using linear prediction coefficients determined from a quantized LSP parameter sequence for the current frame and a quantized LSP parameter sequence for the preceding frame as the filter coefficients for a synthesis filter serving as a time-domain filter, applying the synthesis filter to a signal generated by synthesis of the waveforms contained in an adaptive codebook and the waveforms contained in a fixed codebook so as to determine a synthesized signal, and determining indices for the respective codebooks such that the distortion between the synthesized signal determined and the input sound signal is minimized.
In the frequency domain encoding method, a quantized LSP parameter sequence is converted to linear prediction coefficients to determine a quantized linear prediction coefficient sequence; the quantized linear prediction coefficient sequence is smoothed to determine a adjusted quantized linear prediction coefficient sequence; a signal from which the effect of the spectral envelope has been removed is determined by normalizing each value in a frequency domain signal series which is determined by converting the input sound signal to the frequency domain using each value in a power spectral envelope series, which is a series in the frequency domain corresponding to the adjusted quantized linear prediction coefficients; and the determined signal is coded by variable length encoding taking into account spectral envelope information.
As described, linear prediction coefficients determined through linear prediction analysis of the input sound signal are employed in common in the frequency domain and time domain encoding methods.
Linear prediction coefficients are converted into a sequence of frequency domain parameters equivalent to the linear prediction coefficients, such as LSP (Line Spectrum Pair) parameters or ISP (Immittance Spectrum Pairs) parameters. Then, LSP codes (or ISP codes) generated by encoding the LSP parameter sequence (or ISP parameter sequence) are transmitted to a decoding apparatus. The frequencies from 0 to π of LSP parameters used in quantization or interpolation are sometimes specifically referred distinctively as LSP frequencies (LSF) or as ISP frequencies (ISF) in the case of ISP frequencies; however, such frequency parameters are referred to as LSP parameters or ISP parameters in the description of the present application.
Referring to FIGS. 1 and 2, processing performed by a conventional encoding apparatus will be described more specifically.
In the following description, an LSP parameter sequence consisting of p LSP parameters will be represented as θ[1], θ[2], . . . , θ[p]. “p” represents the order of prediction which is an integer equal to or greater than 1. The symbol in brackets ([ ]) represents index. For example, θ[i] indicates the ith LSP parameter in an LSP parameter sequence θ[1], θ[2], . . . , θ[p].
A symbol written in the upper right of θ in brackets indicates frame number. For example, an LSP parameter sequence generated for the sound signals in the fth frame is represented as θ[f][1], θ[f][2], . . . , θ[f][p]. However, since most processing is conducted within a frame in a closed manner, indication of the upper right frame number is omitted for parameters that correspond to the current frame (the fth frame). Omission of a frame number is intended to mean parameters generated for the current frame That is, θ[i]=θ[f][i] holds.
A symbol written in the upper right without brackets represents exponentiation. That is, θk[i] means the kth power of θ[i].
Although symbols used in the text such as “{tilde over ( )}”, “{circumflex over ( )}”, and “ ” should be originally indicated immediately above the following letter, they are indicated immediately before the corresponding letter due to limitations in text denotation. In mathematical expressions, such symbols are indicated at the appropriate position, namely immediately above the corresponding letter.
At step S100, a speech sound digital signal (hereinafter referred to as input sound signal) in the time domain per frame, which defines a predetermined time segment, is input to a conventional encoding apparatus 9. The encoding apparatus 9 performs processing in the processing units described below on the input sound signal on a per-frame basis.
A per-frame input sound signal is input to a linear prediction analysis unit 105, a feature amount extracting unit 120, a frequency domain encoding unit 150, and a time domain encoding unit 170.
At step S105, the linear prediction analysis unit 105 performs linear prediction analysis on the per-frame input sound signal to determine a linear prediction coefficient sequence a[1], a[2], . . . , a[p], and outputs it. Here, a[i] is a linear prediction coefficient of the ith order. Each coefficient a[i] in the linear prediction coefficient sequence is coefficient a[i] (i=1, 2, . . . , p) that is obtained when input sound signal z is modeled with the linear prediction model represented by Formula (1):
                              A          ⁡                      (            z            )                          =                  1          +                                    ∑                              i                =                1                            p                        ⁢                                                  ⁢                                          a                ⁡                                  [                  i                  ]                                            ⁢                              z                                  -                  i                                                                                        (        1        )            
The linear prediction coefficient sequence a[1], a[2], . . . , a[p] output by the linear prediction analysis unit 105 is input to an LSP generating unit 110.
At step S110, the LSP generating unit 110 determines and outputs a series of LSP parameters, θ[1], θ[2], . . . , θ[p], corresponding to the linear prediction coefficient sequence a[1], a[2], . . . , a[p] output from the linear prediction analysis unit 105. In the following description, the series of LSP parameters, θ[1], θ[2], . . . , θ[p], will be referred to as an LSP parameter sequence. The LSP parameter sequence θ[1], θ[2], . . . , θ[p] is a series of parameters that are defined as the root of the sum polynomial defined by Formula (2) and the difference polynomial defined by Formula (3).F1(z)=A(z)+z−(p+1)A(z−1)  (2)F2(z)=A(z)−z−(p+1)A(z−1)  (3)
The LSP parameter sequence θ[1], θ[2], . . . , θ[p] is a series in which values are arranged in ascending order. That is, it satisfies0<θ[1]<θ[2]< . . . <θ[p]<π.
The LSP parameter sequence θ[1], θ[2], . . . , θ[p] output by the LSP generating unit 110 is input to an LSP encoding unit 115.
At step S115, the LSP encoding unit 115 encodes the LSP parameter sequence θ[1], θ[2], . . . , θ[p] output by the LSP generating unit 110, determines LSP code C1 and a quantized LSP parameter series {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p] corresponding to the LSP code C1, and outputs them. In the following description, the quantized LSP parameter series {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p] will be referred to as a quantized LSP parameter sequence.
The quantized LSP parameter sequence {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p] output by the LSP encoding unit 115 is input to a quantized linear prediction coefficient generating unit 900, a delay input unit 165, and a time domain encoding unit 170. The LSP code C1 output by the LSP encoding unit 115 is input to an output unit 175.
At step S120, the feature amount extracting unit 120 extracts the magnitude of the temporal variation in the input sound signal as the feature amount. When the extracted feature amount is smaller than a predetermined threshold (i.e., when the temporal variation in the input sound signal is small), the feature amount extracting unit 120 implements control so that the quantized linear prediction coefficient generating unit 900 will perform the subsequent processing. At the same time, the feature amount extracting unit 120 inputs information indicating the frequency domain encoding method to the output unit 175 as identification code Cg. Meanwhile, when the extracted feature amount is equal to or greater than the predetermined threshold (i.e., when the temporal variation in the input sound signal is large), the feature amount extracting unit 120 implements control so that the time domain encoding unit 170 will perform the subsequent processing. At the same time, the feature amount extracting unit 120 inputs information indicating the time domain encoding method to the output unit 175 as identification code Cg.
Processes in the quantized linear prediction coefficient generating to unit 900, a quantized linear prediction coefficient adjusting unit 905, an approximate smoothed power spectral envelope series calculating unit 910, and the frequency domain encoding unit 150 are executed when the feature amount extracted by the feature amount extracting unit 120 is smaller than the predetermined threshold (i.e., when the temporal variation in the input sound signal is small) (step S121).
At step S900, the quantized linear prediction coefficient generating unit 900 determines a series of linear prediction coefficients, {circumflex over ( )}a[1], {circumflex over ( )}a[2], . . . , {circumflex over ( )}a[p], from the quantized LSP parameter sequence {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p] output by the LSP encoding unit 115, and outputs it. In the following description, the linear prediction coefficient series {circumflex over ( )}a[1], {circumflex over ( )}a[2], . . . , {circumflex over ( )}a[p] will be referred to as a quantized linear prediction coefficient sequence.
The quantized linear prediction coefficient sequence {circumflex over ( )}a[1], {circumflex over ( )}a[2], . . . , {circumflex over ( )}a[p] output by the quantized linear prediction coefficient generating unit 900 is input to the quantized linear prediction coefficient adjusting unit 905.
At step S905, the quantized linear prediction coefficient adjusting unit 905 determines and outputs a series {circumflex over ( )}a[1]×(γR), {circumflex over ( )}a[2]×(γR)2, . . . , {circumflex over ( )}a[p]×(γR)p of the value {circumflex over ( )}a[i]×(γR)i, which is the product of the ith-order coefficient {circumflex over ( )}a[i] (i=1, . . . , p) in the quantized linear prediction coefficient sequence {circumflex over ( )}a[1], {circumflex over ( )}a[2], . . . , {circumflex over ( )}a[p] output by the quantized linear prediction coefficient generating unit 900 and the ith power of adjustment factor γR. Here, the adjustment factor γR is a predetermined positive integer equal to or smaller than 1. In the following description, the series {circumflex over ( )}a[1]×(γR), {circumflex over ( )}a[2]×(γR)2, . . . , {circumflex over ( )}a[p]×(γR)p will be referred to as a adjusted quantized linear prediction coefficient sequence.
The adjusted quantized linear prediction coefficient sequence {circumflex over ( )}a[1]×(γR), {circumflex over ( )}a[2]×(γR)2, . . . , {circumflex over ( )}a[p]×(γR)p output by the quantized linear prediction coefficient adjusting unit 905 is input to the approximate smoothed power spectral envelope series calculating unit 910.
At step S910, using each coefficient {circumflex over ( )}a[i]×(γR)i in the adjusted quantized linear prediction coefficient sequence {circumflex over ( )}a[1]×(γR), {circumflex over ( )}a[2]×(γR)2, . . . , {circumflex over ( )}a[p]×(γR)p output by the quantized linear prediction coefficient adjusting unit 905, the approximate smoothed power spectral envelope series calculating unit 910 generates an approximate smoothed power spectral envelope series {tilde over ( )}WγR[1], {tilde over ( )}WγR[2], . . . , {tilde over ( )}WγR[N] by Formula (4) and outputs it. Here, exp(⋅) is an exponential function whose base is Napier's constant, j is the imaginary unit, and σ2 is prediction residual energy.
                                                        W              ~                                      γ              ⁢              R                                ⁡                      [            n            ]                          =                              σ            2                                2            ⁢            π            ⁢                                                                            1                  +                                                            ∑                                              i                        =                        1                                            p                                        ⁢                                                                                  ⁢                                                                                            a                          ^                                                ⁡                                                  [                          i                          ]                                                                    ·                                                                        (                                                      γ                            ⁢                            R                                                    )                                                i                                            ·                                              exp                        ⁡                                                  (                                                      -                            ijn                                                    )                                                                                                                                                2                                                          (        4        )            
As defined by Formula (4), the approximate smoothed power spectral envelope series {tilde over ( )}WγR[1], {tilde over ( )}WγR[2], . . . , {tilde over ( )}WγR[N] is a frequency-domain series corresponding to the adjusted quantized linear prediction coefficient sequence {circumflex over ( )}a[1]×(γR), {circumflex over ( )}a[2]×(γR)2, {circumflex over ( )}a[p]×(γR)p.
The approximate smoothed power spectral envelope series {tilde over ( )}WγR[1], {tilde over ( )}WγR[2], . . . , {tilde over ( )}WγR[N] output by the approximate smoothed power spectral envelope series calculating unit 910 is input to the frequency domain encoding unit 150.
In the following, the reason why a series of values defined by Formula (4) is called an approximate smoothed power spectral envelope series will be explained.
With a pth-order autoregressive process which is an all-pole model, input sound signal x[t] at time t is represented by Formula (5) with its own values in the past back to time p, i.e., x[t−1], . . . , x[t−p], a prediction residual e[t], and linear prediction coefficients a[1], a[2], . . . , a[p]. Then, each coefficient W[n] (n=1, . . . , N) in a power spectral envelope series W[1], W[2], . . . , W[N] of the input sound signal is represented by Formula (6):
                                          x            ⁡                          [              t              ]                                +                                    a              ⁡                              [                1                ]                                      ⁢                          x              ⁡                              [                                  t                  -                  1                                ]                                              +          …          ⁢                                          +                                    a              ⁡                              [                p                ]                                      ⁢                          x              ⁡                              [                                  t                  -                  p                                ]                                                    =                  e          ⁡                      [            t            ]                                              (        5        )                                          W          ⁡                      [            n            ]                          =                                            σ              2                                      2              ⁢              π                                ⁢                      1                                                                            1                  +                                                            ∑                                              i                        =                        1                                            p                                        ⁢                                                                                  ⁢                                                                  a                        ⁡                                                  [                          i                          ]                                                                    ·                                              exp                        ⁡                                                  (                                                      -                            jin                                                    )                                                                                                                                                2                                                          (        6        )                                Here        ,                  a          ⁢                                          ⁢          series          ⁢                                          ⁢                                    W                              γ                ⁢                R                                      ⁡                          [              1              ]                                      ,                              W                          γ              ⁢              R                                ⁡                      [            2            ]                          ,        …        ⁢                                  ,                                            W                              γ                ⁢                R                                      ⁡                          [              N              ]                                ⁢                                          ⁢          defined          ⁢                                          ⁢          by                                                                                          W                          γ              ⁢              R                                ⁡                      [            n            ]                          =                              σ            2                                2            ⁢            π            ⁢                                                                            1                  +                                                            ∑                                              i                        =                        1                                            p                                        ⁢                                                                                  ⁢                                                                  a                        ⁡                                                  [                          i                          ]                                                                    ⁢                                                                                                    (                                                          γ                              ⁢                              R                                                        )                                                    i                                                ·                                                  exp                          ⁡                                                      (                                                          -                              ijn                                                        )                                                                                                                                                                          2                                                          (        7        )            in which a[i] in Formula (6) is replaced with a[i]×(γR)i is equivalent to the power spectral envelope series W[1], W[2], . . . , W[N] of the input sound signal defined by Formula (6) but with the waves of the amplitude smoothed. In other words, processing for adjusting a linear prediction coefficient by multiplying linear prediction coefficient a[i] by the ith power of the adjustment factor γR is equivalent to processing that flats the waves of the amplitude of the power spectral envelope in the frequency domain (processing for smoothing the power spectral envelope). Accordingly, the series WγR[1], WγR[2], . . . , WγR[N] defined by Formula (7) is called a smoothed power spectral envelope series.
The series {tilde over ( )}WγR[1], {tilde over ( )}WγR[2], . . . , {tilde over ( )}WγR[N] defined by Formula (4) is equivalent to a series of approximations of the individual values in the smoothed power spectral envelope series WγR[1], WγR[2], . . . , WγR[N] defined by Formula (7). Accordingly, the series {tilde over ( )}WγR[1], {tilde over ( )}WγR[2], . . . , {tilde over ( )}WγR[N] defined by Formula (4) is called an approximate smoothed power spectral envelope series.
At step S150, the frequency domain encoding unit 150 normalizes each value X[n] (n=1, . . . , N) in a frequency domain signal sequence X[1], X[2], . . . , X[N], generated by converting the input sound signal into the frequency domain, with the square root of each value {tilde over ( )}WγR[n] in the approximate smoothed power spectral envelope series, thereby determining a normalized frequency domain signal sequence XN[1], XN[2], . . . , XN[N]. That is to say, XN[n]=X[n]/sqrt (˜WγR[n]) holds. Here, sqrt(y) represents the square root of y. The frequency domain encoding unit 150 then encodes the normalized frequency domain signal sequence XN[1], XN[2], . . . , XN[N] by variable length encoding to generate frequency domain signal codes.
The frequency domain signal codes output by the frequency domain encoding unit 150 are input to the output unit 175.
The delay input unit 165 and the time domain encoding unit 170 are executed when the feature amount extracted by the feature amount extracting unit 120 is equal to or greater than the predetermined threshold (i.e., when the temporal variation in the input sound signal is large) (step S121).
At step S165, the delay input unit 165 holds the input quantized LSP parameter sequence {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p], and outputs it to the time domain encoding unit 170 with a delay equivalent to the duration of one frame. For example, if the current frame is the fth frame, the quantized LSP parameter sequence for the f-lth frame, {circumflex over ( )}θ[f−1][1], {circumflex over ( )}θ[f−1][2], . . . , {circumflex over ( )}θ[f−1][p], is output to the time domain encoding unit 170.
At step S170, the time domain encoding unit 170 carries out encoding by determining a synthesized signal by applying the synthesis filter to a signal generated by synthesis of the waveforms contained in the adaptive codebook and the waveforms contained in the fixed codebook, and determining the indices for the respective codebooks so that the distortion between the synthesized signal determined and the input sound signal is minimized. When determining the indices for the codebooks so that the distortion between the synthesized signal and the input sound signal is minimized, the codebook indices are determined so as to minimize the value given by applying an auditory weighting filter to a signal representing the difference of the synthesized signal from the input sound signal. The auditory weighting filter is a filter for determining distortion when selecting the adaptive codebook and/or the fixed codebook.
The filter coefficients of the synthesis filter and the auditory weighting filter are generated by use of the quantized LSP parameter sequence for the fth frame, {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p], and the quantized LSP parameter sequence for the f−1th frame, {circumflex over ( )}θ[f−1][1], {circumflex over ( )}θ[f−1][2], . . . , {circumflex over ( )}θ[f−1][p].
Specifically, a frame is first divided into two subframes, and the filter coefficients for the synthesis filter and the auditory weighting filter are determined as follows.
In the latter-half subframe, each coefficient {circumflex over ( )}a[i] in a quantized linear prediction coefficient sequence {circumflex over ( )}a[1], {circumflex over ( )}a[2], . . . , {circumflex over ( )}a[p], which is a coefficient sequence obtained by converting the quantized LSP parameter sequence for the fth frame, {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p], into linear prediction coefficients, is employed for the filter coefficient of the synthesis filter. For the filter coefficients of the auditory weighting filter, a series of values,{circumflex over ( )}a[1]×(γR), {circumflex over ( )}a[2]×(γR)2, . . . , {circumflex over ( )}a[p]×(γR)p,is employed which is determined by multiplying each coefficient {circumflex over ( )}a[i] in the quantized linear prediction coefficient sequence {circumflex over ( )}a[1], {circumflex over ( )}a[2], . . . , {circumflex over ( )}a[p] by the ith power of adjustment factor γR.
In the first-half subframe, each coefficient {tilde over ( )}a[i] in an interpolated quantized linear prediction coefficient sequence {tilde over ( )}a[1], {tilde over ( )}a[2], . . . , {tilde over ( )}a[p], which is a coefficient sequence obtained by converting an interpolated quantized LSP parameter sequence {tilde over ( )}θ[1], {tilde over ( )}θ[2], . . . , {tilde over ( )}θ[p] into linear prediction coefficients, is employed for the filter coefficient of the synthesis filter. The interpolated quantized LSP parameter sequence {tilde over ( )}θ[1], {tilde over ( )}θ0[2], . . . , {tilde over ( )}θ[p] is a series of intermediate values between each value {circumflex over ( )}θ[i] in the quantized LSP parameter sequence for the fth frame, {circumflex over ( )}θ[1], {circumflex over ( )}θ[2], . . . , {circumflex over ( )}θ[p], and each value {circumflex over ( )}θ[f−1][i] in the quantized LSP parameter sequence for the f-1th frame, {circumflex over ( )}θ[f−1][1], {circumflex over ( )}θ[f−1][2], . . . , {circumflex over ( )}θ[f−1][p], namely a series of values obtained by interpolating between the values {circumflex over ( )}θ[i] and {circumflex over ( )}[f−1][i]. For the filter coefficients of the auditory weighting filter, a series of values,{tilde over ( )}a[1]×(γR), {tilde over ( )}a[2]×(γR)2, . . . , {tilde over ( )}a[p]×(γR)p,is employed which is determined by multiplying each coefficient {tilde over ( )}a[i] in the interpolated quantized linear prediction coefficient sequence {tilde over ( )}a[1], {tilde over ( )}a[2], . . . , {tilde over ( )}a[p] by the ith power of the adjustment factor γR.
This has the effect of smoothing the transition between a decoded sound signal and the decoded sound signal for the preceding frame generated in the decoding apparatus. Note that the adjustment factor γ used in the time domain encoding unit 170 is the same as the adjustment factor γ used in the approximate smoothed power spectral envelope series calculating unit 910.
At step S175, the encoding apparatus 9 transmits, by way of the output unit 175, the LSP code C1 output by the LSP encoding unit 115, the identification code Cg output by the feature amount extracting unit 120, and either the frequency domain signal codes output by the frequency domain encoding unit 150 or the time domain signal codes output by the time domain encoding unit 170, to the decoding apparatus.