Up to now, in various speech encoding methods and devices, an input speech is divided into a spectrum envelope information and a excitation which are encoded by a frame unit, respectively, to produce a speech code.
As the most representative speech encoding method and device, there is a code-excited linear prediction (CELP) system disclosed in Document 1 (ITU-T Recommendation G.729, “CODING OF SPEECH AT 8 kbit/s USING CONJUGATE-STRUCTURE ALGEBRAIC-CODE-EXCITED LINEAR-PREDICTION (CS-ACELP)”, March of 1996), or the like.
FIG. 8 is a block diagram showing an overall structure of a conventional CELP system speech encoding device disclosed in Document 1.
Referring to the figure, reference numeral 1 denotes an input speech, reference numeral 2 is a linear prediction analyzing means, reference numeral 3 is a linear prediction coefficient encoding means, reference numeral 4 is an adaptive excitation encoding means, reference numeral 5 is a fixed excitation encoding portion, reference numeral 6 is a gain encoding means, reference numeral 7 is a multiplexing means, and reference numeral 8 is a speech code.
The conventional speech encoding device conducts processing by a frame unit with one frame of 10 ms. In encoding the excitation, processing is conducted every sub-frame that results from dividing one frame into two equal pieces. For facilitation of description, in the description below, the frame and the sub-frame are not particularly distinct and referred to simply as “frame”.
Hereinafter, the operation of the conventional speech encoding device will be described. First, the input speech 1 is inputted to the linear prediction analyzing means 2, the adaptive speech encoding means 4 and the gain encoding means 6, respectively. The linear prediction analyzing means 2 analyzes the input speech 1 and extracts a linear prediction coefficient which is a spectrum envelope information of the speech. The linear prediction coefficient encoding means 3 encodes the linear prediction coefficient and outputs a code of the encoded linear prediction coefficient to the multiplexing means 7 and outputs the linear prediction coefficient which has been quantized for encoding the excitation.
The adaptive excitation encoding means 4 stores a past excitation (signal) having a given length as an adaptive excitation codebook therein, and generates a time series vector (adaptive excitation) that periodically repeats the past excitation in correspondence with each adaptive excitation code indicated by a binary value of several bits which is generated internally. Then, the time series vector is allowed to pass through a synthesis filter using the quantized linear prediction coefficient which has been outputted from the linear prediction coefficient encoding means 3, to thereby obtain a temporal synthetic speech. A distortion between a signal resulting from multiplying the temporal synthetic speech by an appropriate gain and the input speech 1 is investigated, and an adaptive excitation code that minimizes the distortion minimizes is selected and then outputted to the multiplexing means 7, and simultaneously the time series vector that corresponds to the selected adaptive excitation code is outputted as the adaptive excitation to the fixed excitation encoding portion 5 and the gain encoding means 6. Also, a signal resulting from subtracting from the input speech 1 the signal obtained by multiplying the synthetic speech by the appropriate gain due to the adaptive excitation is outputted to the fixed excitation encoding portion 5 as a signal to be encoded.
The fixed excitation encoding portion 5 first sequentially reads the time series vector (fixed excitation) from the drive speech codebook that is stored internally in correspondence with the respective fixed excitation codes that are indicated by the binary values which are generated internally. Then, the time series vector is allowed to pass through the synthesis filter using the quantized linear prediction coefficient which has been outputted from the linear prediction coefficient encoding means 3, to thereby obtain a temporal synthetic speech. A distortion between a signal resulting from multiplying the temporal synthetic speech by an appropriate gain and the signal to be encoded which is a signal resulting from subtracting the synthetic speech due to the adaptive excitation from the input speech 1 is investigated, and the fixed excitation code that minimizes the distortion is selected and outputted to the multiplexing means 7, and the time series vector that corresponds to the selected fixed excitation code is outputted to the gain encoding means 6 as the fixed excitation.
The gain encoding means 6 first sequentially reads the gain vector from the gain codebook that is stored therein in accordance with each gain code indicated by the binary value which is generated internally. Then, each of the component of the respective gain vectors are multiplied by the adaptive excitation outputted from the adaptive excitation encoding means 4 and the fixed excitation outputted from the fixed excitation encoding means 5, respectively, and added to each other to produce a excitation, and the produced excitation is allowed to pass through a synthesis filter using a quantized linear prediction coefficient which has been outputted from the linear prediction coefficient encoding means 3, to thereby obtain a temporal synthetic speech. A distortion between the temporal synthetic speech and the input speech 1 is investigated, and a gain code that minimizes the distortion is selected and then outputted to the multiplexing means 7. Also, the excitation thus produced which corresponds to the gain code is outputted to the adaptive excitation encoding means 4.
Finally, the adaptive excitation encoding means 4 updates the internal adaptive excitation codebook by using the excitation corresponding to the gain code which is produced by the gain encoding means 6.
The multiplexing means 7 multiplexes the code of the linear prediction coefficient outputted from the linear prediction coefficient encoding means 3, the adaptive excitation code outputted from the adaptive excitation encoding means 4, the fixed excitation code outputted from the fixed excitation encoding portion 5 and the gain code outputted from the gain encoding means 6 to output the obtained speech code 8.
FIG. 9 is a block diagram showing the detailed structure of the fixed excitation encoding portion 5 of the conventional CELP system speech encoding device disclosed in Document 1 or the like.
Referring to FIG. 9, reference numeral 9 denotes an adaptive excitation generating means, reference numeral 10 and 14 are synthesis filters, reference numeral 11 is a subtracting means, reference numeral 12 is a signal to be encoded, reference numeral 13 is a fixed excitation generating means, reference numeral 15 is a distortion calculating portion, reference numeral 20 is a searching means, reference numeral 21 is a fixed excitation code, and reference numeral 22 is a fixed excitation. The distortion calculating portion 15 is made up of an perceptual weighting filter 16, an perceptual weighting filter 17, a subtracting means 18 and a power calculating means 19. The adaptive excitation generating means 9, the synthesis filter 10 and the subtracting means 11 are included in the adaptive excitation encoding means 4, but are shown together for facilitation of understanding the contents.
First, the adaptive excitation generating means 9 within the adaptive excitation encoding means 4 outputs a time series vector corresponding to the above-mentioned adaptive excitation code to the synthesis filter 10 as the adaptive excitation.
The synthesis filter 10 within the adaptive excitation encoding means 4 sets the quantized linear prediction coefficient outputted from the linear prediction coefficient encoding means shown in FIG. 8 as a filter coefficient, and conducts synthesis filtering on the adaptive excitation outputted from the adaptive excitation generating means 9 to output the obtained synthetic speech to the subtracting means 11.
The subtracting means 11 within the adaptive excitation encoding means 4 determines a difference signal between the synthetic speech outputted from the synthesis filter 10 and the input speech 1 and outputs the obtained difference signal as the signal 12 to be encoded in the fixed excitation encoding portion 5.
On the other hand, the searching means 20 sequentially generates the respective fixed excitation codes indicated by the binary values, and outputs the fixed excitation codes to the fixed excitation generating means 13 in order.
The fixed excitation generating means 13 reads the time series vector from the fixed excitation codebook stored internally in accordance with the fixed excitation code outputted from the searching means 20, and outputs the time series vector to the synthesis filter 14 as the fixed excitation. The fixed excitation codebook may be a fixed excitation codebook that stores a noise vector prepared in advance, an algebraic excitation codebook that algebraically describes the time series vector by combination of a pulse position with a polarity, or the like. Also, there are fixed excitation codebooks which are of the addition type of two or more codebooks or which include a pitch cycling using the repetitive cycle of the adaptive excitation therein.
The synthesis filter 14 sets the quantized linear prediction coefficient that are outputted from the linear prediction coefficient encoding means 3 as the filter coefficient, and conducts the synthesis filtering on the fixed excitation outputted from the fixed excitation generating means 13 to output the obtained synthetic speech to the distortion calculating portion 15.
The perceptual weighting filter 16 within the distortion calculating portion 15 calculates an perceptual weighting filter coefficient on the basis of the quantized linear prediction coefficient that are outputted from the linear prediction coefficient encoding means 3, sets the perceptual weighting filter coefficient as the filter coefficient, and filters the signal 12 to be encoded which is outputted from the subtracting means 11 within the adaptive excitation encoding means 4 to output the obtained signal to the subtracting means 18.
The perceptual weighting filter 17 within the distortion calculating portion 15 sets the same filter coefficient as the perceptual weighting filter 16, and filters the synthetic speech outputted from the synthesis filter 14 to output the obtained signal to the subtracting means 18.
The subtracting means 18 within the distortion calculating portion 15 determines a difference signal between the signal outputted from the perceptual weighting filter 16 and a signal resulting from multiplying the signal outputted from the perceptual weighting filter 17 by an appropriate gain, and outputs the difference signal to the power calculating means 19.
The power calculating means 19 within the distortion calculating portion 15 obtains a total power of the difference signal outputted from the subtracting means 18, and outputs the total power to the searching means 20 as a evaluation value for search.
The searching means 20 searches a fixed excitation code that minimizes the evaluation value for search outputted from the power calculating means 19 within the distortion calculating portion 15, and outputs the fixed excitation code that minimizes the evaluation value for search as the fixed excitation code 21. Also, the fixed excitation generating means 13 outputs the fixed excitation outputted when inputting the fixed excitation code 21 as the fixed excitation 22.
The gain multiplied by the subtracting means 18 is uniquely determined by solving a partial differential equation so as to minimize the evaluation value for search. Various modified manners of the internal structure of the actual distortion calculating portion 15 have been reported in order to reduce the amount of calculation.
Also, JP 7-271397 A discloses several methods of reducing the amount of calculation of the distortion calculating portion. Hereinafter, the method of the distortion calculating portion disclosed in JP 7-271397 A will be described.
Assuming that a synthetic speech obtained by allowing the fixed excitation to pass through the synthesis filter 14 is Yi and an input speech is R (corresponding to the signal 12 to be encoded in FIG. 9), the evaluation value for search defined as a waveform-related distortion between two signals is represented by Expression (1).E=|R−αYi|2  (1)
This coincides with a case in which the perceptual weighting filter is not introduced in the evaluation value for search calculation described with reference to FIG. 9. α is a gain multiplied by the subtracting means 18, and a that sets an expression resulting from partially differentiating Expression (1) with respect to a to zero is found, and this is substituted for a in Expression (1) to obtain Expression (2).
                    E        =                                                          R                                      2                    -                                                    (                                  R                  ,                                      Y                    i                                                  )                            2                        /                                                                            Y                  i                                                            2                                                          (        2        )            
Since a first term of Expression (2) is a constant that does not depend on the fixed excitation, minimizing the evaluation value for search E is equal to maximizing a second term of Expression (2). Therefore, there are many cases in which the second term of Expression (2) is used as the evaluation value for search as it is.
Because a large amount of calculation is required for calculating the second term of Expression (2), a preliminary selection is conducted using the simplified evaluation value for search, and the second term of Expression (2) is calculated with respect to only the fixed excitation that is preliminarily selected, and a main selection is then conducted to reduce the amount of calculation in JP 7-271397 A.
Expressions (3) to (5) or the like are employed as the simplified evaluation value for search used in the preliminary selection.E′=(R,Yi)2  (3)E′=W(yi)(R,Yi)2  (4)E′=W(C,i)(R,Yi)2  (5)
There has been reported that Yi is a fixed excitation, C is a fixed excitation group stored in the codebook, and the weight coefficient W defined by those factors is set as the evaluation value for search in the preliminary selection with the result that a precision in the preliminary selection in the case of using Expression (4) or Expression (5) is higher than that in the case of using Expression (3).
Comparing Expression (3), Expression (4) and Expression (5) which are the simplified evaluation value for search s at the time of the preliminary selection with the second term of Expression (2) which is the evaluation value for search at the time of the main selection, there are only differences in the multiplication of the weight coefficient based on the fixed excitation group C or the fixed excitation yi, and the subtraction portion due to the power of the synthetic speech Yi of the fixed excitation. Expression (3), Expression (4) and Expression (5) approximate the second term of Expression (2), and both cases evaluate the waveform-related distortion between two signals indicated in Expression (1).
However, the above-mentioned conventional speech encoding method and method suffer from problems stated below.
In the case where the information content which is applicable to the fixed excitation code is small, that is, when the number of fixed excitations becomes smaller, even if the fixed excitation code that minimizes the waveform distortion described with reference to Expression (1) to Expression (5) is selected, a decoded speech obtained by decoding the speech code including the fixed excitation code therein may be deteriorated in tone quality.
FIG. 10 is an explanatory diagram for explaining one case in which the tone quality is deteriorated. In FIG. 10, reference symbol (a) is a signal to be encoded, reference symbol (c) is a fixed excitation, and reference symbol (b) is a synthetic speech obtained by allowing the fixed excitation shown in (c) to pass through the synthesis filter. All of those signals are indicative of signals within a frame to be encoded. In this example, an algebraic excitation that algebraically expresses the pulse position and the polarity is used as the fixed excitation.
In case of FIG. 10, the similarity between (a) and (b) is high in the second half of the frame, and (a) is relatively excellently expressed. On the other hand, the amplitude of (b) becomes 0 in the first half of the frame, and (a) cannot be expressed at all. In the case where the gain of the adaptive excitation is not largely taken on the rising portion of the speech or the like, there are many cases in which a portion at which the encoding characteristic of the partial frame is extremely deteriorated sounds like a local abnormal noise on the decoded speech.
That is, in the conventional method of selecting the fixed excitation code that minimizes the waveform-related distortion of the overall frame, even if the portion at which the encoding characteristic is extremely deteriorated exists on a part of the frame as shown in FIG. 10, the fixed excitation code is selected, resulting in a problem that the quality of the decoded speech is deteriorated.
This problem is not eliminated even by using the simplified evaluation value for search as disclosed in JP 7-271397 A.
The present invention has been made to solve the above-mentioned problem, and therefore an object of the present invention is to provide a high-quality speech encoding method and device which hardly generate a local abnormal noise of the decoded speech. Also, another object of the present invention is to provide a high-quality speech encoding method and device while suppressing an increase in the amount of calculation to the minimum.