This invention relates to a voice code conversion method and apparatus for converting voice code obtained by encoding performed by a first voice encoding scheme to voice code of a second voice encoding scheme. More particularly, the invention relates to a voice code conversion method and apparatus for converting voice code, which has been obtained by encoding voice by a first voice encoding scheme used over the Internet or by a cellular telephone system, etc., to voice code of a second encoding scheme that is different from the first voice encoding scheme.
There has been an explosive increase in subscribers to cellular telephones in recent years and it is predicted that the number of such users will continue to grow in the future. Voice communication using the Internet (Voice over IP, or VoIP) is coming into increasingly greater use in intracorporate IP networks (intranets) and for the provision of long-distance telephone service. In voice communication systems such as cellular telephone systems and VoIP, use is made of voice encoding technology for compressing voice in order to utilize the communication channel effectively.
In the case of cellular telephones, the voice encoding technology used differs depending upon the country or system. With regard to cdma 2000 expected to be employed as the next-generation cellular telephone system, EVRC (Enhanced Variable-Rate Codec) has been adopted as a voice encoding scheme. With VoIP, on the other hand, a scheme compliant with ITU-T Recommendation G.729A is being used widely as the voice encoding method. An overview of G.729A and EVRC will be described first.
(1) Description of G.729A
Encoder Structure and Operation
FIG. 15 is a diagram illustrating the structure of an encoder compliant with ITU-T Recommendation G.729A. As shown in FIG. 15, input signals (speech signals) X of a predetermined number (=N) of samples per frame are input to an LPC (Linear Prediction Coefficient) analyzer 1 frame by frame. If the sampling speed is 8 kHz and the length of a single frame is 10 ms, then one frame will be composed of 80 samples. The LPC analyzer 1, which is regarded as an all-pole filter represented by the following equation, obtains filter coefficients αi (i=1, . . . P), here P represents the order of the filter:H(z)=1/[1+Σαi·z−i] (i=1 to P)  (1)Generally, in the case of voice in the telephone band, a value of 10 to 12 is used as P. The LPC analyzer 1 performs LPC analysis using 80 samples of the input signal, 40 pre-read samples and 120 past signal samples, for a total of 240 samples, and obtains the LPC coefficients.
A parameter converter 2 converts the LPC coefficients to LSP (Line Spectrum Pair) parameters. An LSP parameter is a parameter of a frequency region in which mutual conversion with LPC coefficients is possible. Since a quantization characteristic is superior to LPC coefficients, quantization is performed in the LSP domain. An LSP quantizer 3 quantizes an LSP parameter obtained by the conversion and obtains an LSP code and an LSP dequantized value. An LSP interpolator 4 obtains an LSP interpolated value from the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame. More specifically, one frame is divided into two subframes, namely first and second subframes, of 5 ms each, and the LPC analyzer 1 determines the LPC coefficients of the second subframe but not of the first subframe. Using the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame, the LSP interpolator 4 predicts the LSP dequantized value of the first subframe by interpolation.
A parameter deconverter 5 converts the LSP dequantized value and the LSP interpolated value to LPC coefficients and sets these coefficients in an LPC synthesis filter 6. In this case, the LPC coefficients converted from the LSP interpolated values in the first subframe of the frame and the LPC coefficients converted from the LSP dequantized values in the second subframe are used as the filter coefficients of the LPC synthesis filter 6. In the description that follows, the “l” in items having an index attached to the “l”, e.g., lspi, li(n), . . . , is the letter “l” in the alphabet.
After LSP parameters lspi (i=1, . . . , P) are quantized by scalar quantization or vector quantization in the LSP quantizer 3, the quantization indices (LSP codes) are sent to the decoder side. FIG. 16 is a diagram useful in describing the quantization method. Here sets of large numbers of quantization LSP parameters have been stored in a quantization table 3a in correspondence with index numbers 1 to n. A distance calculation unit 3b calculates distance in accordance with the following equation:
  d  =            ∑      i        ⁢                            {                                    l              ⁢                                                          ⁢                                                sp                  q                                ⁢                                  (                  i                  )                                                      -            lspi                    }                2            ⁢              (                  i          =                      1            ∼            P                          )            When q is varied from 1 to n, a minimum-distance index detector 3c finds the q for which the distance d is minimized and sends the index q to the decoder side as an LSP code.
Next, sound-source and gain search processing is executed. Sound source and gain are processed on a per-subframe basis. First, a sound-source signal is divided into a pitch-period component and a noise component, an adaptive codebook 7 storing a sequence of past sound-source signals is used to quantize the pitch-period component and an algebraic codebook or noise codebook is used to quantize the noise component. Described below will be voice encoding using the adaptive codebook 7 and an algebraic codebook 8 as sound-source codebooks.
The adaptive codebook 7 is adapted to output N samples of sound-source signals (referred to as “periodicity signals”), which are delayed successively by one sample, in association with indices 1 to L. FIG. 17 is a diagram showing the structure of the adaptive codebook 7 in the case of a subframe of 40 samples (N=40). The adaptive codebook is constituted by a buffer BF for storing the pitch-period component of the latest (L+39) samples. A periodicity signal comprising 1 to 40 samples is specified by index 1, a periodicity signal comprising 2 to 41 samples is specified by index 2, . . . , and a periodicity signal comprising L to L+39 samples is specified by index L. In the initial state, the content of the adaptive codebook 7 is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe so that the sound-source signal obtained in the present frame will be stored in the adaptive codebook 7.
An adaptive-codebook search identifies the periodicity component of the sound-source signal using the adaptive codebook 7 storing past sound-source signals. That is, a subframe length (=40 samples) of past sound-source signals in the adaptive codebook 7 are extracted while changing, one sample at a time, the point at which read-out from the adaptive codebook 7 starts, and the sound-source signals are input to the LPC synthesis filter 6 to create a pitch synthesis signal βAPL, where PL represents a past periodicity signal (adaptive code vector), which corresponds to delay L, extracted from the adaptive codebook 7, A the impulse response of the LPC synthesis filter 6, and β the gain of the adaptive codebook.
An arithmetic unit 9 finds an error power EL between the input voice X and βAPL in accordance with the following equation:EL=|X−βAPL|2  (2)
If we let APL represent a weighted synthesized output from the adaptive codebook, Rpp the autocorrelation of APL and Rxp the cross-correlation between APL and the input signal X, then an adaptive code vector PL at a pitch lag Lopt for which the error power of Equation (2) is minimum will be expressed by the following equation:PL=argmax(Rxp2/Rpp)  (3)That is, the optimum starting point for read-out from the codebook is that at which the value obtained by normalizing the cross-correlation Rxp between the pitch synthesis signal APL and the input signal X by the autocorrelation Rpp of the pitch synthesis signal is largest. Accordingly, an error-power evaluation unit 10 finds the pitch lag Lopt that satisfies Equation (3). Optimum pitch gain βopt is given by the following equation:βopt=Rxp/Rpp  (4)
Next, the noise component contained in the sound-source signal is quantized using the algebraic codebook 8. The latter is constituted by a plurality of pulses of amplitude 1 or −1. By way of example, FIG. 18 illustrates pulse positions for a case where frame length is 40 samples. The algebraic codebook 8 divides the N (=40) sampling points constituting one frame into a plurality of pulse-system groups 1 to 4 and, for all combinations obtained by extracting one sampling point from each of the pulse-system groups, successively outputs, as noise components, pulsed signals having a +1 or a −1 pulse at each sampling point. In this example, basically four pulses are deployed per frame. FIG. 19 is a diagram useful in describing sampling points assigned to each of the pulse-system groups 1 to 4.
(1) Eight sampling points 0, 5, 10, 15, 20, 25, 30, 35 are assigned to the pulse-system group 1;
(2) eight sampling points 1, 6, 11, 16, 21, 26, 31, 36 are assigned to the pulse-system group 2;
(3) eight sampling points 2, 7, 12, 17, 22, 27, 32, 37 are assigned to the pulse-system group 3; and
(4) 16 sampling points 3, 4, 8, 9, 13, 14, 18, 19, 23, 24, 28, 29, 33, 34, 38, 39 are assigned to the pulse-system group 4.
Three bits are required to express the sampling points in pulse-system groups 1 to 3 and one bit is required to express the sign of a pulse, for a total of four bits. Further, four bits are required to express the sampling points in pulse-system group 4 and one bit is required to express the sign of a pulse, for a total of five bits. Accordingly, 17 bits are necessary to specify a pulsed signal output from the noise codebook 8 having the pulse placement of FIG. 18, and 217 types of pulsed signals exist.
The pulse positions of each of the pulse systems are limited, as illustrated in FIG. 18. In the algebraic codebook search, a combination of pulses for which the error power relative to the input voice is minimized in the reconstruction region is decided from among the combinations of pulse positions of each of the pulse systems. More specifically, with βopt as the optimum pitch gain found by the adaptive-codebook search, the output PL of the adoptive codebook is multiplied by βopt and the product is input to an adder 11. At the same time, the pulsed signals are input successively to the adder 11 from the algebraic codebook 8 and a pulsed signal is specified that will minimize the difference between the input signal X and a reproduced signal obtained by inputting the adder output to the LPC synthesis filter 6. More specifically, first a target vector X′ for an algebraic codebook search is generated in accordance with the following equation from the optimum adaptive codebook output PL and optimum pitch gain βopt obtained from the input signal X by the adaptive-codebook search:X′=X−βoptAPL  (5)
In this example, pulse position and amplitude (sign) are expressed by 17 bits and therefore 217 combinations exist. Accordingly, letting CK represent a kth algebraic-code output vector, a code vector CK that will minimize an evaluation-function error power D in the following equation is found by a search of the algebraic codebook:D=|X′−GcACK|2  (6)where Gc represents the gain of the algebraic codebook. In the algebraic codebook search, the error-power evaluation unit 10 searches for the combination of pulse position and polarity that will afford the largest normalized cross-correlation value (Rcx*Rcx/Rcc) obtained by normalizing the square of a cross-correlation value Rcx between an algebraic synthesis signal ACK and input signal X′ by an autocorrelation value Rcc of the algebraic synthesis signal. The result output from the algebraic codebook search is the position and sign (positive or negative) of each pulse. These results shall be referred to collectively as algebraic code.
Gain quantization will be described next. With the G.729A system, algebraic codebook gain is not quantized directly. Rather, the adaptive codebook gain Ga (=βopt) and a correction coefficient γ of the algebraic codebook gain Gc are vector quantized. The algebraic codebook gain Gc and the correction coefficient y are related as follows:Gc=g′×γwhere g′ represents the gain of the present frame predicted from the logarithmic gains of the four past subframes.
A gain quantizer 12 has a gain quantization table (gain codebook), not shown, for which there are prepared 128 (=27) combinations of adaptive codebook gain Ga and correction coefficients γ for algebraic codebook gain. The method of the gain codebook search includes {circle around (1)} extracting one set of table values from the gain quantization table with regard to an output vector from the adaptive codebook and an output vector from the algebraic codebook and setting these values in gain varying units 13, 14, respectively; {circle around (2)} multiplying these vectors by gains Ga, Gc using the gain varying units 13, 14, respectively, and inputting the products to the LPC synthesis filter 6; and {circle around (3)} selecting, by way of the error-power evaluation unit 10, the combination for which the error power relative to the input signal X is minimized.
A channel encoder 15 creates channel data by multiplexing {circle around (1)} an LSP code, which is the quantization index of the LSP, {circle around (2)} a pitch-lag code Lopt, {circle around (3)} an algebraic code, which is an algebraic codebook index, and {circle around (4)} a gain code, which is a quantization index of gain. The channel encoder 15 sends this channel data to a decoder.
Thus, as described above, the G.729A encoding system produces a model of the speech generation process, quantizes the characteristic parameters of this model and transmits the parameters, thereby making it possible to compress speech efficiently.
Decoder Structure and Operation
FIG. 20 is a block diagram illustrating a G.729A-compliant decoder. Channel data sent from the encoder side is input to a channel decoder 21, which proceeds to output an LSP code, pitch-lag code, algebraic code and gain code. The decoder decodes voice data based upon these codes. The operation of the decoder will now be described, though parts of the description will be redundant because functions of the decoder are included in the encoder.
Upon receiving the LSP code as an input, an LSP dequantizer 22 applies dequantization and outputs an LSP dequantized value. An LSP interpolator 23 interpolates an LSP dequantized value of the first subframe of the present frame from the LSP dequantized value in the second subframe of the present frame and the LSP dequantized value in the second subframe of the previous frame. Next, a parameter deconverter 24 converts the LSP interpolated value and the LSP dequantized value to LPC synthesis filter coefficients. A G.729A-compliant synthesis filter 25 uses the LPC coefficient converted from the LSP interpolated value in the initial first subframe and uses the LPC coefficient converted from the LSP dequantized value in the ensuing second subframe.
An adaptive codebook 26 outputs a pitch signal of subframe length (=40 samples) from a read-out starting point specified by a pitch-lag code, and a noise codebook 27 outputs a pulse position and pulse polarity from a read-out position that corresponds to an algebraic code. A gain dequantizer 28 calculates an adaptive codebook gain dequantized value and an algebraic codebook gain dequantized value from the gain code applied thereto and sets these vales in gain varying units 29, 30, respectively. An adder 31 creates a sound-source signal by adding a signal, which is obtained by multiplying the output of the adaptive codebook by the adaptive codebook gain dequantized value, and a signal obtained by multiplying the output of the algebraic codebook by the algebraic codebook gain dequantized value. The sound-source signal is input to an LPC synthesis filter 25. As a result, reconstructed speech can be obtained from the LPC synthesis filter 25.
In the initial state, the content of the adaptive codebook 26 on the decoder side is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe so that the sound-source signal obtained in the present frame will be stored in the adaptive codebook 26. In other words, the adaptive codebook 7 of the encoder and the adaptive codebook 26 of the decoder are always maintained in the identical, latest state.
(2) Description of EVRC
EVRC is characterized in that the number of bits transmitted per frame is varied in dependence upon the nature of the input signal. More specifically, bit rate is raised in steady segments such as vowel segments and the number of transmitted bits is lowered in silent or transient segments, thereby reducing the average bit rate over time. EVRC bit rates are shown in Table 1.
TABLE 1BIT RATEVOICE SEGMENTMODEbits/framekbits/sOF INTERESTFULL RATE1718.55STEADY SEGMENTHALF RATE804.0VARIABLESEGMENT⅛ RATE160.8SILENT SEGMENT
With EVRC, the rate of the input signal of the present frame is determined. The rate determination involves dividing the frequency region of an input speech signal into high and low regions and calculating power in each region, comparing the power values of each of these regions with two predetermined threshold values, selecting the full rate if the low-region power and the high-region power exceed the threshold values, selecting the half rate if only the low-region power or high-region power exceeds the threshold value, and selecting the ⅛ rate if the low- and high-region power values are both lower than the threshold values.
FIG. 21 illustrates the structure of an EVRC encoder. With EVRC, an input signal that has been segmented into 20-ms frames (160 samples) is input to an encoder. Further, one frame of the input signal is segmented into three subframes, as indicated in Table 2 below. It should be noted that the structure of the encoder is substantially the same in the case of both full rate and half rate, and that only the numbers of quantization bits of the quantizers differ between the two. The description rendered below, therefore, will relate to the full-rate case.
TABLE 2SUBFRAME NO.123SUBFRAMENUMBER OF535354LENGTHSAMPLESMILLISECONDS6.6256.6256.750
As shown in FIG. 22, an LPC (Linear Prediction Coefficient) analyzer 41 obtains LPC coefficients by LPC analysis using 160 samples of the input signal of the present frame and 80 samples of the pre-read segment, for a total of 240 samples. An LSP quantizer 42 converts the LPC coefficients to LSP parameters and then performs quantization to obtain LSP code. An LSP dequantizer 43 obtains an LSP dequantized value from the LSP code. Using the LSP dequantized value found in the present frame (the LSP dequantized value of the third subframe) and the LSP dequantized value found in the previous frame, an LSP interpolator 44 predicts the LSP dequantized value of the 0th, 1st and 2nd subframes of the present frame by linear interpolation.
Next, a pitch analyzer 45 obtains the pitch lag and pitch gain of the present frame. According to EVRC, pitch analysis is performed twice per frame. The position of the analytical window of pitch analysis is as shown in FIG. 22. The procedure of pitch analysis is as follows:
(1) The input signal of the present frame and the pre-read signal are input to an LPC inverse filter composed of the above-mentioned LPC coefficients, whereby an LPC residual signal is obtained. If H(z) represents the LPC synthesis filter, then the LPC inverse filter is 1/H(z).
(2) The autocorrelation function of the LPC residual filter is found, and the pitch lag and pitch gain for which the autocorrelation function will be maximized are obtained.
(3) The above-described processing is executed at two analytical window positions. Let Lag1 and Gain1 represent the pitch lag and pitch gain found by the first analysis, respectively, and let Lag2 and Gain2 represent the pitch lag and pitch gain found by the second analysis, respectively.
(4) When the difference between Gain1 and Gain2 is equal to or greater than a predetermined threshold value, Gain1 and Lag1 are adopted as the pitch gain and pitch lag, respectively, of the present frame. When the difference between Gain1 and Gain2 is less than the predetermined threshold value, Gain2 and Lag2 are adopted as the pitch gain and pitch lag, respectively, of the present frame.
The pitch lag and pitch gain are found by the above-described procedure. A pitch-gain quantizer 46 quantizes the pitch gain using a quantization table and outputs pitch-gain code. A pitch-gain dequantizer 47 dequantizes the pitch-gain code and inputs the result to a gain varying unit 48. Whereas pitch lag and pitch gain are obtained on a per-subframe basis with G.729A, EVRC differs in that pitch lag and pitch gain are obtained on a per-frame basis.
Further, EVRC differs in that an input-voice correction unit 49 corrects the input signal in dependence upon the pitch-lag code. That is, rather than finding the pitch lag and pitch gain for which error relative to the input signal is smallest, as is done in accordance with G.729A, the input-voice correction unit 49 in EVRC corrects the input signal in such a manner that it will approach closest to the output of the adaptive codebook decided by the pitch lag and pitch gain found by pitch analysis. More specifically, the input-voice correction unit 49 converts the input signal to a residual signal by an LPC inverse filter and time-shifts the position of the pitch peak in the region of the residual signal in such a manner that the position will be the same as the pitch-peak position in the output of an adaptive codebook 47.
Next, a noise-like sound-source signal and gain are decided on a per-subframe basis. First, an adaptive-codebook synthesized signal obtained by passing the output of an adaptive codebook 50 through the gain varying unit 48 and an LPC synthesis filter 51 is subtracted from the corrected input signal, which is output from the input-voice correction unit 49, by an arithmetic unit 52, thereby generating a target signal X′ of an algebraic codebook search. An EVRC adaptive codebook 53 is composed of a plurality of pulses, in a manner similar to that of G.729A, and 35 bits per subframe are allocated in the full-rate case. Table 3 below illustrates the full-rate pulse positions.
TABLE 3EVRC ALGEBRAIC CODEBOOK (FULL RATE)PULSE SYSTEMPULSE POSITIONPOLARITYT00, 5, 10, 15, 20, 25,+/−30, 35, 40, 45, 50T11, 6, 11, 16, 21, 26,+/−31, 36, 41, 46, 51T22, 7, 12, 17, 22, 27,+/−32, 37, 42, 47, 52T33, 8, 13, 18, 23, 28,+/−33, 38, 43, 48, 53T44, 9, 14, 19, 24, 29,+/−34, 39, 44, 49, 54
The method of searching the algebraic codebook is similar to that of G.729A, though the number of pulses selected from each pulse system differs. Two pulses are assigned to three of the five pulse systems, and one pulse is assigned to two of the five pulse systems. Combinations of systems that assign one pulse are limited to four, namely T3-T4, T4-T0, T0-T1 and T1-T2. Accordingly, combinations of pulse systems and pulse numbers are as shown in Table 4 below.
TABLE 4PULSE-SYSTEM COMBINATIONSONE-PULSETWO-PULSESYSTEMSSYSTEMS(1)T3, T4T0, T1, T2(2)T4, T0T1, T2, T3(3)T0, T1T2, T3, T4(4)T1, T2T3, T4, T0
Thus, since there are systems that assign one pulse and systems that assign two pulses, the number of bits allocated to each pulse system differs depending upon the number of pulses. Table 5 below indicates the bit distribution of the algebraic codebook in the full-rate case.
TABLE 5BIT DISTRIBUTION OF EVRC ALGEBRAIC CODEBOOKNUMBER OFBITPULSESINFORMATIONDISTRIBUTIONONE PULSECOMBINATIONS 2 BITS (FOUR)PULSE POSITIONS 7 BITS (11 × 11) =121 < 128POLARITY 2 BITSTWO PULSESPULSE POSITIONS21 BITS (7 × 3)POLARITY (SAME AS 3 BITS (3 × 1)THAT OF ONE-PULSESYSTEMTOTAL35 BITS
Since combinations of one-pulse systems are four in number, two bits are necessary. If 11 pulse positions in two pulse systems in which the number of pulses is one are arrayed in the X and Y directions, an 11×11 grid can be formed and a pulse position in the two pulse systems can be specified by one grid point. Accordingly, seven bits are necessary to specify a pulse position in two pulse systems in which the number of pulses is one, and two bits are necessary to express the polarity of a pulse in two pulse systems in which the number of pulses is one. Further, 7×3 bits are necessary to specify a pulse position in three pulse systems in which the number of pulses is two, and 1×3 bits are necessary to express the polarity of a pulse in three pulse systems in which the number of pulses is two. It should be noted that the polarity of pulses in the one-pulse systems is the same. Thus, in EVRC, an algebraic codebook can be expressed by a total of 35 bits.
In the algebraic codebook search, the algebraic codebook 53 generates an algebraic synthesis signal by successively inputting pulsed signals to a gain multiplier 54 and LPC synthesis filter 55, and an arithmetic unit 56 calculates the difference between the algebraic synthesis signal and target signal X′ and obtains the code vector Ck that will minimize the evaluation-function error power D in the following equation:D=|X′−GcACK|2 where Gc represents the gain of the algebraic codebook. In the algebraic codebook search, an error-power evaluation unit 59 searches for the combination of pulse position and polarity that will afford the largest normalized cross-correlation value (Rcx*Rcx/Rcc) obtained by normalizing the square of a cross-correlation value Rcx between the algebraic synthesis signal ACK and target signal X′ by an autocorrelation value Rcc of the algebraic synthesis signal.
Algebraic codebook gain is not quantized directly. Rather, the correction coefficient γ of the algebraic codebook gain is scalar quantized by five bits per subframe. The correction coefficient γ is a value (γ=Gc/g′) obtained by normalizing algebraic codebook gain Gc by g′, where g′ represents gain predicted from past subframes.
A channel multiplexer 60 creates channel data by multiplexing {circle around (1)} an LSP code, which is the quantization index of the LSP, {circle around (2)} a pitch-lag code, {circle around (3)} an algebraic code, which is an algebraic codebook index, {circle around (4)} a pitch-gain code, which is the quantization index of the pitch gain, and {circle around (5)} an algebraic codebook gain code, which is the quantization index of algebraic codebook gain. The multiplexer 60 sends the channel data to a decoder.
It should be noted that the decoder is so adapted as to decode the LSP code, pitch-lag code, algebraic code, pitch-gain code and algebraic codebook gain code sent from the encoder. The EVRC decoder can be created in a manner similar to that in which a G.729 decoder is created to deal with a G.729 encoder. The EVRC decoder, therefore, need not be described here.
(3) Conversion of Voice Code According to the Prior Art
It is believed that the growing popularity of the Internet and cellular telephones will lead to ever increasing voice traffic by Internet users and users of cellular telephone networks. However, communication between a cellular telephone network and the Internet cannot take place if a voice encoding scheme used by the cellular telephone network and a voice encoding scheme used by the Internet differ.
FIG. 30 is a diagram showing the principle of a typical voice code conversion method according to the prior art. This method shall be referred to as “prior art 1” below. This example takes into consideration only a case where voice input to a terminal 71 by a user A is sent to a terminal 72 of a user B. It is assumed here that the terminal 71 possessed by user A has only an encoder 71a of an encoding scheme 1 and that the terminal 72 of user B has only a decoder 72a of an encoding scheme 2.
Voice that has been produced by user A on the transmitting side is input to the encoder 71a of encoding scheme 1 incorporated in terminal 71. The encoder 71a encodes the input speech signal to a voice code of the encoding scheme 1 and outputs this code to a transmission path 71b. When the voice code enters via the transmission path 71b, a decoder 73a of the voice code converter 73 decodes reproduced voice from the voice code of encoding scheme 1. An encoder 73b of the voice code converter 73 then converts the reconstructed speech signal to voice code of the encoding scheme 2 and sends this voice code to a transmission path 72b. The voice code of the encoding scheme 2 is input to the terminal 72 through the transmission path 72b. Upon receiving the voice code as an input, the decoder 72a decodes reconstructed speech from the voice code of the encoding scheme 2. As a result, the user B on the receiving side is capable of hearing the reconstructed speech. Processing for decoding voice that has first been encoded and then re-encoding the decoded voice is referred to as “tandem connection”.
With the implementation of prior art 1, as described above, the practice is to rely upon the tandem connection in which a voice code that has been encoded by voice encoding scheme 1 is decoded into voice temporarily, after which the decoded voice is re-encoded by voice encoding scheme 2. Problems arise as a consequence, namely a pronounced decline in the quality of reconstructed speech and an increase in delay. In other words, voice (reconstructed speech) that has been encoded and compressed in terms of information content is voice having less information than that of the original voice (original sound). Hence the sound quality of the reconstructed speech is much poorer than that of the original sound. In particular, with recent low-bit-rate voice encoding schemes typified by G.729A and EVRC, encoding is performed while discarding a great deal of information contained in the input voice in order to realize a high compression rate. When use is made of a tandem connection in which encoding and decoding are repeated, the quality of reconstructed speed undergoes a market decline.
A technique proposed as a method of solving this problem of the tandem connection decomposes voice code into parameter codes such as LSP code and pitch-lag code without returning the voice code to a speech signal, and converts each parameter code separately to a code of a separate voice encoding scheme (see the specification of Japanese Patent Application No. 2001-75427). FIG. 24 is a diagram illustrating the principle of this proposal, which shall be referred to as “prior art 2” below.
Encoder 71a of encoding scheme 1 incorporated in terminal 1 encodes a speech signal produced by user A to a voice code of encoding scheme 1 and sends this voice code to transmission path 71b. A voice code conversion unit 74 converts the voice code of encoding scheme 1 that has entered from the transmission path 71b to a voice code of encoding scheme 2 and sends this voice code to transmission path 72b. Decoder 72a in terminal 72 decodes reconstructed speech from the voice code of encoding scheme 2 that enters via the transmission path 72b, and user B is capable of hearing the reconstructed speech.
The encoding scheme 1 encodes a speech signal by {circle around (1)} a first LSP code obtained by quantizing LSP parameters, which are found from linear prediction coefficients (LPC) obtained by frame-by-frame linear prediction analysis; {circle around (2)} a first pitch-lag code, which specifies the output signal of an adaptive codebook that is for outputting a periodic sound-source signal; {circle around (3)} a first algebraic code (noise code), which specifies the output signal of an algebraic codebook (or noise codebook) that is for outputting a noise-like sound-source signal; and {circle around (4)} a first gain code obtained by quantizing pitch gain, which represents the amplitude of the output signal of the adaptive codebook, and algebraic codebook gain, which represents the amplitude of the output signal of the algebraic codebook. The encoding scheme 2 encodes a speech signal by {circle around (1)} a second LPC code, {circle around (2)} a second pitch-lag code, {circle around (3)} a second algebraic code (noise code) and {circle around (4)} a second gain code, which are obtained by quantization in accordance with a quantization method different from that of voice encoding scheme 1.
The voice code conversion unit 74 has a code demultiplexer 74a, an LSP code converter 74b, a pitch-lag code converter 74c, an algebraic code converter 74d, a gain code converter 74e and a code multiplexer 74f. The code demultiplexer 74a demultiplexes the voice code of voice encoding scheme 1, which code enters from the encoder 71a of terminal 71 via the transmission path 71b, into codes of a plurality of components necessary to reconstruct a speech signal, namely {circle around (1)} LSP code, {circle around (2)} pitch-lag code, {circle around (3)} algebraic code and {circle around (4)} gain code. These codes are input to the code converters 74b, 74c, 74d and 74e, respectively. The latter convert the entered LSP code, pitch-lag code, algebraic code and gain code of voice encoding scheme 1 to LSP code, pitch-lag code, algebraic code and gain code of voice encoding scheme 2, and the code multiplexer 74f multiplexes these codes of voice encoding scheme 2 and sends the multiplexed signal to the transmission path 72b. 
FIG. 25 is a block diagram illustrating the voice code conversion unit 74 in which the construction of the code converters 74b to 74e is clarified. Components in FIG. 25 identical with those shown in FIG. 24 are designated by like reference characters. The code demultiplexer 74a demultiplexes an LSP code 1, a pitch-lag code 1, an algebraic code 1 and a gain code 1 from the speech signal of encoding scheme 1 that enters from the transmission path via an input terminal #1, and inputs these codes to the code converters 74b, 74c, 74d and 74e, respectively.
The LSP code converter 74b has an LSP dequantizer 74b1 for dequantizing the LSP code 1 of encoding scheme 1 and outputting an LSP dequantized value, and an LSP quantizer 74b2 for quantizing the LSP dequantized value using an algebraic code quantization table of encoding scheme 2 and outputting an LSP code 2. The pitch-lag code converter 74c has a pitch-lag dequantizer 74c1 for dequantizing the pitch-lag code 1 of encoding scheme 1 and outputting a pitch-lag dequantized value, and a pitch-lag quantizer 74c2 for quantizing the pitch-lag dequantized value by encoding scheme 2 and outputting a pitch-lag code 2. The algebraic code converter 74d has an algebraic dequantizer 74d1 for dequantizing the algebraic code 1 of encoding scheme 1 and outputting an algebraic dequantized value, and an algebraic quantizer 74d2 for quantizing the algebraic dequantized value using an algebraic code quantization table of encoding scheme 2 and outputting an algebraic code 2. The gain code converter 74e has a gain dequantizer 74e1 for dequantizing the gain code 1 of encoding scheme 1 and outputting a gain dequantized value, and a gain quantizer 74e2 for quantizing the gain dequantized value using a gain quantization table of encoding scheme 2 and outputting a gain code 2.
The code multiplexer 74f multiplexes the LSP code 2, pitch-lag code 2, algebraic code 2 and gain code 2, which are output from the quantizers 74b2, 74c2, 74d2 and 74e2, respectively, thereby creating a voice code based upon encoding scheme 2, and sends this code to the transmission path from an output terminal #2.
The tandem connection scheme (prior art 1) of FIG. 23 receives an input of reproduced speech, which is obtained by temporarily decoding, to voice, voice code that has been encoded by encoding scheme 1, and executes encoding and decoding again. As a result, voice parameters are extracted from reproduced speech in which the amount of information is much less than that of the original sound owing to re-execution of encoding (namely compression of voice information). Consequently, the voice code thus obtained is not necessarily the best. By contrast, in accordance with the voice encoding apparatus of prior art 2 shown in FIG. 24, voice code of encoding scheme 1 is converted to voice code of encoding scheme 2 via the process of dequantization and quantization. This makes it possible to perform voice code conversion in which there is much less degradation in comparison with the tandem connection of prior art 1. Further, since it is unnecessary to decode to voice even once for the sake of voice code conversion, another advantage is that delay, which is a problem with the tandem connection, is reduced.
In a VoIP network, G.729A is used as the voice encoding scheme. In a cdma 2000 network, on the other hand, which is expected to served as a next-generation cellular telephone system, EVRC is adopted. Table 6 below indicates results obtained by comparing the main specifications of G.729A and EVRC.
TABLE 6COMPARISON OF G.729A AND EVRC MAIN SPECIFICATIONSG.729AEVRCSAMPLING FREQUENCY8kHz8kHzFRAME LENGTH10ms20msSUBFRAME LENGTH5ms6.625/6.625/6.75msNUMBER OF SUBFRAMES23
Frame length and subframe length according to G.729A are 10 ms and 5 ms, respectively, while EVRC frame length is 20 ms and is segmented into three subframes. This means that EVRC subframe length is 6.625 ms (only the final subframe has a length of 6.75 ms), and that both frame length and subframe length differ from those of G.729A. Table 7 below indicates the results obtained by comparing bit allocation of G.729A with that of EVRC.
TABLE 7G.729A AND EVRC BIT ALLOCATIONG.729AEVRC (FULL RATE)PARAMETERSUBFRAME/FRAMESUBFRAME/FRAMELSP CODE—/18—/29PITCH-LAG CODE8, 5/13—/12PITCH-GAIN CODE—3, 3, 3/9ALGEBRAIC CODE17, 17/3435, 35, 35/105ALGEBRAIC CODE—5, 5, 5/15GAIN CODEGAIN CODE7, 7/14—NOT ASSIGNED——/1TOTAL80 BITS/10 ms171 BITS/20 ms
In a case where voice communication is performed between a VoIP network and a network compliant with cdma 2000, a voice code conversion technique for converting one voice code to another voice code is required. The above-described examples of prior art 1 and prior art 2 are known as techniques used in such case.
With prior art 1, speech is reconstructed temporarily from voice code according to voice encoding scheme 1, and the reconstructed speech is applied as an input and encoded again according to voice encoding scheme 2. This makes it possible to convert code without being affected by the difference between the two encoding schemes. However, when the re-encoding is performed according to this method, certain problems arise, namely pre-reading (i.e., delay) of signals owing to LPC analysis and pitch analysis, and a major decline in sound quality.
With voice code conversion according to prior art 2, a conversion to voice code is made on the assumption that subframe length in encoding scheme 1 and subframe length in encoding scheme 2 are equal, and therefore a problem arises in code conversion in a case where the subframe lengths of the two encoding schemes differ. That is, since the algebraic codebook is such that pulse position candidates are decided in accordance with subframe length, pulse positions are completely different between schemes (G.729A and EVRC) having different subframe lengths, and it is difficult to make pulse positions correspond on a one-to-one basis.