1. Field of the Invention
The present invention relates to a speech encoding apparatus and a speech encoding method for compressing a digital speech signal to reduce its information quantity. The present invention also relates to a speech decoding apparatus and a speech decoding method for decoding speech code generated by the above speech encoding apparatus so as to generate a digital speech signal.
2. Description of Related Art
Many of prior art speech encoding methods and speech decoding methods divide an input speech into spectral envelope information and excitation information, and encode each type of information in units of frames each having a predetermined length to generate speech code. The generated speech code is decoded into the spectral envelope information and the excitation information which are then combined by use of a synthesis filter to obtain a decoded speech. The most representative of speech encoding/decoding apparatuses to which the above speech encoding/decoding methods are applied include those using the Code-Excited Linear Prediction (CELP) system.
FIG. 13 is a schematic diagram showing the configuration of a conventional CELP-type speech encoding apparatus. In the figure, reference numeral 1 denotes a linear prediction analysis unit for analyzing an input speech and extracting linear prediction coefficients, which denote spectral envelope information of the input speech, while reference numeral 2 denotes a linear prediction coefficient encoding unit for encoding the linear prediction coefficients extracted by the linear prediction analysis unit 1 and outputting the resultant code to a multiplexing unit 6 as well as outputting quantized values of the linear prediction coefficients to an adaptive excitation encoding unit 3, a fixed excitation encoding unit 4, and a gain encoding unit 5.
Reference numeral 3 denotes the adaptive excitation encoding unit for generating a tentative synthesized speech by use of the quantized values of the linear prediction coefficients output from the linear prediction coefficient encoding unit 2 as well as selecting adaptive excitation code with which the distance between the tentative synthesized speech and the input speech is minimized and outputting the thus selected adaptive excitation code to the multiplexing unit 6. The adaptive excitation encoding unit 3 also outputs to the gain encoding unit 5 an adaptive excitation signal (a time-series vector obtained as a result of repeating a past excitation signal having a given length) corresponding to the adaptive excitation code. Reference numeral 4 denotes the fixed excitation encoding unit for generating a tentative synthesized speech by use of the quantized values of the linear prediction coefficients output from the linear prediction coefficient encoding unit 2 as well as selecting fixed excitation code with which the distance between the tentative synthesized speech and a signal to be encoded (a signal obtained as a result of subtracting from the input speech the synthesized speech produced based on the adaptive excitation signal) is minimized and outputting the selected fixed excitation code to the multiplexing unit 6. The fixed excitation encoding unit 4 also outputs to the gain encoding unit 5 a fixed excitation signal which is a time-series vector corresponding to the fixed excitation code.
Reference numeral 5 denotes the gain encoding unit for multiplying both the adaptive excitation signal output from the adaptive excitation encoding unit 3 and the fixed excitation signal output from the fixed excitation encoding unit 4 by each element of a gain vector, and adding each respective pair of the multiplication results, so as to generate an excitation signal. The gain encoding unit 5 also generates a tentative synthesized speech from the above excitation signal by use of the quantized values of the linear prediction coefficients output from the linear prediction coefficient encoding unit 2, selects gain code with which the distance between the tentative synthesized speech and the input speech is minimized, and outputs the selected gain code to the multiplexing unit 6. Reference numeral 6 denotes the multiplexing unit for multiplexing the code of the linear prediction coefficients encoded by the linear prediction coefficient encoding unit 2, the adaptive excitation code output from the adaptive excitation encoding unit 3, the fixed excitation code output from the fixed excitation encoding unit 4, and the gain code output from the gain encoding unit 5 so as to produce speech code.
FIG. 14 is a schematic diagram showing the internal configuration of the fixed excitation encoding unit 4. In the figure, reference numeral 11 denotes a fixed excitation code book; 12 a synthesis filter; 13 a distortion calculating unit; and 14 a distortion evaluating unit.
FIG. 15 is a schematic diagram showing the configuration of a conventional CELP-type speech decoding apparatus. In the figure, reference numeral 21 denotes a separating unit for separating the speech code output from the speech encoding apparatus into the code of the linear prediction coefficients, the adaptive excitation code, the fixed excitation code, and the gain code, which are then supplied to a linear prediction coefficient decoding unit 22, an adaptive excitation decoding unit 23, a fixed excitation decoding unit 24, and a gain decoding unit 25, respectively. Reference numeral 22 denotes the linear prediction coefficient decoding unit for decoding the code of the linear prediction coefficients output from the separating unit 21 and outputting the decoded quantized values of the linear prediction coefficients to a synthesis filter 29.
Reference numeral 23 denotes the adaptive excitation decoding unit for outputting an adaptive excitation signal (a time-series vector obtained as a result of repeating a past excitation signal) corresponding to the adaptive excitation code output from the separating unit 21, while reference numeral 24 denotes the fixed excitation decoding unit for outputting a fixed excitation signal (a time-series vector) corresponding to the fixed excitation code output from the separating unit 21. Reference numeral 25 denotes the gain decoding unit for outputting a gain vector corresponding to the gain code output from the separating unit 21.
Reference numeral 26 denotes a multiplier for multiplying the adaptive excitation signal output from the adaptive excitation decoding unit 23 by an element of the gain vector output from the gain decoding unit 25, while reference numeral 27 denotes another multiplier for multiplying the fixed excitation signal output from the fixed excitation decoding unit 24 by another element of the gain vector output from the gain decoding unit 25. Reference numeral 28 denotes an adder for adding the multiplication result of the multiplier 26 and the multiplication result of the multiplier 27 together to generate an excitation signal. Reference numeral 29 denotes the synthesis filter for performing synthesis filtering processing on the excitation signal generated by the adder 28 so as to produce an output speech.
FIG. 16 is a schematic diagram showing the internal configuration of the fixed excitation decoding unit 24. In the figure, reference numeral 31 denotes a fixed excitation code book.
The operations of the speech encoding apparatus and the speech decoding apparatus will be described below.
The conventional speech encoding/decoding apparatuses perform processing in units of frames each having a time duration of approximately 5 to 50 ms.
Upon receiving a speech, the linear prediction analysis unit 1 in the speech encoding apparatus analyzes the input speech and extracts the linear prediction coefficients, which are spectral envelope information on the speech.
After the linear prediction analysis unit 1 has extracted the linear prediction coefficients, the linear prediction coefficient encoding unit 2 encodes the linear prediction coefficients and outputs the code to the multiplexing unit 6. The linear prediction coefficient encoding unit 2 also outputs quantized values of the linear prediction coefficients to the adaptive excitation encoding unit 3, the fixed excitation encoding unit 4, and the gain encoding unit 5.
The adaptive excitation encoding unit 3 has a built-in adaptive excitation code book storing past excitation signals having a predetermined length, and generates a time-series vector which is obtained as a result of periodically repeating a past excitation signal, based on each internally-generated adaptive excitation code (indicated by a binary number having a few bits).
The adaptive excitation encoding unit 3 then multiplies each time-series vector by each appropriate gain value, and generates a tentative synthesized speech by passing the time-series vector through the synthesis filter which uses the quantized values of the linear prediction coefficients output from the linear prediction coefficient encoding unit 2.
Furthermore, the adaptive excitation encoding unit 3 evaluates, for example, the distance between the tentative synthesized speech and the input speech to obtain the encoding distortion, and selects and outputs to the multiplexing unit 6 adaptive excitation code with which the distance is minimized as well as outputting to the gain encoding unit 5 a time-series vector corresponding to the selected adaptive excitation code as an adaptive excitation signal.
The adaptive excitation encoding unit 3 also outputs to the fixed excitation encoding unit 4 a signal obtained as a result of subtracting from the input speech a synthesized speech produced based on the adaptive excitation signal, as a signal to be encoded.
Next, the operation of the fixed excitation encoding unit 4 will be described.
The fixed excitation code book 11 included in the fixed excitation encoding unit 4 stores fixed code vectors which are noise-like time-series vectors, and sequentially outputs a time-series vector according to each fixed excitation code (indicated by a binary number having a few bits) output from the distortion evaluating unit 14. Each time-series vector is then multiplied by each appropriate gain value and input to the synthesis filter 12.
The synthesis filter 12 uses the quantized values of the linear prediction coefficients output from the linear prediction coefficient encoding unit 2 to generate a tentative synthesized speech for each gain-multiplied time-series vector.
The distortion calculating unit 13 calculates, for example, the distance between the tentative synthesized speech and the signal to be encoded output from the adaptive excitation encoding unit 3 to obtain the encoding distortion.
The distortion evaluating unit 14 selects and outputs to the multiplexing unit 6 fixed excitation code with which the distance between the tentative synthesized speech and the signal to be encoded calculated by the distortion calculating unit 13 is minimized as well as directing the fixed excitation code book 11 to output to the gain encoding unit 5 a time-series vector corresponding to the selected fixed excitation code as a fixed excitation signal.
The gain encoding unit 5 has a built-in gain code book storing gain vectors, and sequentially reads a gain vector from the gain code book according to each internally-generated gain code (indicated by a binary number having a few bits).
The gain encoding unit 5 multiplies both the adaptive excitation signal output from the adaptive excitation encoding unit 3 and the fixed excitation signal output from the fixed excitation encoding unit 4 by each element of the gain vector, and adds each respective pair of the multiplication results together to generate an excitation signal.
The gain encoding unit 5 then generates a tentative synthesized speech by passing the excitation signal through a synthesis filer which uses the quantized values of the linear prediction coefficients output from the linear prediction coefficient encoding unit 2.
Furthermore, the gain encoding unit 5 evaluates the distance between the tentative synthesized speech and the input speech to obtain the encoding distortion, selects and outputs to the multiplexing unit 6 gain code with which the distance is minimized, and outputs to the adaptive excitation encoding unit 3 an excitation signal corresponding to the gain code. The adaptive excitation encoding unit 3 then uses the excitation signal, which was selected by the gain encoding unit 5 and corresponds to the gain code, to update its built-in adaptive excitation code book.
The multiplexing unit 6 multiplexes the code of the linear prediction coefficients encoded by the linear prediction coefficient encoding unit 2, the adaptive excitation code output from the adaptive excitation encoding unit 3, the fixed excitation code output from the fixed excitation encoding unit 4, and the gain code output from the gain encoding unit 5 to produce speech code as the multiplexed result.
Upon receiving the speech code output from the speech encoding apparatus, the separating unit 21 included in the speech decoding apparatus separates it into the code of the linear prediction coefficients, the adaptive excitation code, the fixed excitation code, and the gain code which are then output to the linear prediction coefficient decoding unit 22, the adaptive excitation decoding unit 23, the fixed excitation decoding unit 24, and the gain decoding unit 25, respectively.
Upon receiving the code of the linear prediction coefficients from the separating unit 21, the linear prediction coefficient decoding unit 22 decodes the code and outputs the quantized values of the linear prediction coefficients to the synthesis filter 29 as the decode result.
The adaptive excitation decoding unit 23 has the built-in adaptive excitation code book storing past excitation signals having a predetermined length, and outputs an adaptive excitation signal (a time-series vector obtained as a result of repeating a past excitation signal) corresponding to the adaptive excitation code output from the separating unit 21.
On the other hand, the fixed excitation code book 31 included in the fixed excitation decoding unit 24 stores fixed code vectors which are noise-like time-series vectors, and outputs as a fixed excitation signal corresponding to the fixed excitation code output from the separating unit 21.
The gain decoding unit 25 has a built-in gain code book storing gain vectors, and outputs a gain vector corresponding to the gain code output from the separating unit 21.
The multipliers 26 and 27 multiply the adaptive excitation signal output from the adaptive excitation decoding unit 23 and the fixed excitation signal output from the fixed excitation decoding unit 24, respectively, by each element of the gain vector. Each respective pair of the multiplication results from the multipliers 26 and 27 are added together by the adder 28.
The synthesis filter 29 performs synthesis filtering processing on the excitation signal obtained as the addition result by the adder 28 to produce an output speech. It should be noted that the synthesis filter 29 uses the quantized values of the linear prediction coefficients decoded by the linear prediction coefficient decoding unit 22 as its filter coefficients.
Lastly, the adaptive excitation decoding unit 23 updates its built-in adaptive excitation code book by use of the above excitation signal.
Next, description will be made of conventional techniques for improving the above CELP-type speech encoding and speech decoding apparatuses.
The following two references propose methods for emphasizing the pitch property of an excitation signal for the purpose of obtaining high-quality speech even using a low bit rate.
Reference 1: Wang et al., “Improved excitation for phonetically-segmented VXC speech coding below 4 kb/s”, Proc. GLOBECOM '90, pp. 946–950
Reference 2: JP-A No. 8-44397 (1996)
Furthermore, the following reference describes a speech encoding system which employs a similar method.
Reference 3: 3GPP Technical Specification 3G TS 26. 090
The ITU Recommendation G. 729 also describes a speech encoding system using another similar method.
FIG. 17 is a schematic diagram showing the internal configuration of a fixed excitation encoding unit 4 which emphasizes the pitch property of an excitation signal. Since the components in the figure which are the same as or correspond to those in FIG. 14 are denoted by like numerals, their explanation will be omitted. It should be noted that the configuration of the encoding system is the same as that shown in FIG. 13 except for the configuration of the fixed excitation encoding unit 4.
In FIG. 17, reference numeral 15 denotes a periodicity providing unit for giving a pitch property to a fixed code vector.
FIG. 18 is a schematic diagram showing the internal configuration of a fixed excitation decoding unit 24 which emphasizes the pitch property of an excitation signal. Since the component in the figure which is the same as or corresponds to that in FIG. 16 is denoted by a like numeral, its explanation will be omitted. It should be noted that the configuration of the decoding system is the same as that shown in FIG. 15 except for the configuration of the fixed excitation decoding unit 24.
In FIG. 18, reference numeral 32 denotes a periodicity providing unit for giving a pitch property to a fixed code vector.
The operations of the speech encoding and speech decoding apparatuses will be described below.
It should be noted that since the apparatuses are the same as the above described CELP-type speech encoding and speech decoding apparatuses except that the fixed excitation encoding unit 4 and the fixed excitation decoding unit 24 include the periodicity providing unit 15 and the periodicity providing unit 32, respectively, only their difference will be described.
The periodicity providing unit 15 emphasizes the pitch periodicity of a time-series vector output from the fixed excitation code book 11 before outputting the time-series vector.
The periodicity providing unit 32 emphasizes the pitch periodicity of a time-series vector output from the fixed excitation code book 31 before outputting the time-series vector.
The periodicity providing unit 15 and 32 use, for example, a comb filter to emphasize the pitch periodicity of a time-series vector.
The gain (periodicity emphasis coefficient) of the comb filter is set to a constant value in Reference 1, while the method employed in Reference 2 uses a long-term prediction gain of the speech signal in each frame to be encoded as a periodicity emphasis coefficient. The method proposed in Reference 3 uses a gain corresponding to an adaptive excitation signal encoded in each past frame.
The conventional speech encoding and speech decoding apparatuses are configured as described above so that their periodicity emphasis coefficient for emphasizing the pitch periodicity is set to a same value over all fixed code vectors. Therefore, when this periodicity emphasis coefficient is set to an inappropriate value, all the fixed code vectors are adversely affected, which makes it impossible to obtain sufficient quality improvement through periodicity emphasis, or which may even cause quality deterioration.
For example, consider a case in which even though a signal to be encoded indicates strong periodicity having a period of T, the periodicity emphasis coefficient is so set such that the impulse response of the comb filter for giving periodicity to fixed code vectors indicates weak periodicity. In such a case, the weak periodicity emphasis is applied to all fixed code vectors, producing large encoding distortion and thereby causing quality deterioration when the signal to encoded indicates strong periodicity.
On the other hand, the periodicity emphasis coefficient may be set so as to give strong periodicity to fixed code vectors when the signal to be encoded indicates weak periodicity. Also in this case, large code distortion is generated and thereby quality deterioration occurs.
In speech encoding, increasing the frame length is effective in increasing the information compression ratio. In such a case, however, since the frame is long, it easily happens that a frame to be analyzed includes unfavorable factors, such as a change in the pitch, which adversely affect proper calculation of the periodicity emphasis coefficient with the composition proposed in Reference 2. Furthermore, the correlation between the gain of a past frame and an appropriate periodicity emphasis coefficient for a current frame is reduced with the composition proposed in Reference 3. These events often cause the periodicity emphasis coefficient to be inappropriately set, worsening the problems described above.
Further, employing a plurality of fixed excitation code books which each store fixed code vectors of a different nature is also effective in increasing the information compression ratio in speech encoding. In this case, the appropriate periodicity emphasis coefficient is different from one fixed excitation code book to another, worsening the quality deterioration caused due to use of only a single periodicity emphasis coefficient.
For example, consider use of both a fixed excitation code book storing noise-like fixed code vectors and another fixed excitation code book storing non-noise-like (pulse-like) fixed code vectors which each store a small number of pulses in its frames. In the case of noise-like fixed code vectors, if they are constantly given strong periodicity, the speech quality of the output speech is improved with respect to noise characteristics. In the case of non-noise-like fixed code vectors, on the other hand, if they are constantly given strong periodicity, the output speech assumes pulse-like speech quality when intrinsically-nonperiodical noise-like input speech is applied, leading to subjective quality degradation.
Further, consider use of a fixed excitation code book storing fixed code vectors whose power distribution is biased, for example, only the first half of their frame includes signals and the second half does not include any signals (that is, include only a zero signal). In such a case, unless these fixed code vectors are given strong periodicity, the encoding characteristics of the second half of their frame is considerably deteriorated, degrading the subjective quality in the portion whose distributed power is small.