This invention relates to a voice code conversion apparatus and, more particularly, to a voice code conversion apparatus to which a voice code obtained by a first voice encoding method is input for converting this voice code to a voice code of a second voice encoding method and outputting the latter voice code.
There has been an explosive increase in subscribers to cellular telephones in recent years and it is predicted that the number of such users will continue to grow in the future. Voice communication using the Internet (Voice over IP, or VoIP) is coming into increasingly greater use in intracorporate IP networks (intranets) and for the provision of long-distance telephone service. In voice communication systems such as cellular telephone systems and VoIP, use is made of voice encoding technology for compressing voice in order to utilize the communication line effectively. In the case of cellular telephones, the voice encoding technology used differs depending upon the country or system. With regard to W-CDMA expected to be employed as the next-generation cellular telephone system, AMR (Adaptive Multi-Rate) has been adopted as the common global voice-encoding method. With VoIP, on the other hand, a method compliant with ITU-T Recommendation G.729A is being used as the voice encoding method. The AMR and G.729A methods both employ a basic algorithm referred to as CELP (Code Excited Linear Prediction). The CELP operating principles will now be described taking the G.729A method as an example.
CELP Operating Principles
CELP is characterized by the efficient transmission of linear prediction coefficients (LPC coefficients) representing the voice characteristics of the human vocal tract, and a sound-source signal comprising the pitch component and noise component of voice. More specifically, in accordance with CELP, the human vocal tract is approximated by an LPC synthesis filter H(z) expressed by the following equation:
                              H          ⁡                      (            z            )                          =                  1                      1            +                                          ∑                                  i                  =                  1                                p                            ⁢                                                          ⁢                                                a                  i                                ⁢                                  z                                      -                    i                                                                                                          (        1        )            and it is assumed that the sound-source signal input to the LPC synthesis filter H(z) can be separated into a pitch-period component representing the periodicity of voice and a noise component representing randomness. CELP, rather than transmitting the input voice signal to the decoder side directly, extracts the filter coefficients of the LPC synthesis filter and the pitch-period and noise components of the excitation signal, quantizes these to obtain quantization indices and transmits the quantization indices, thereby implementing a high degree of information compression.
Encoder Structure and Operation
FIG. 23 is a diagram illustrating a method compliant with ITU-U Recommendation G.729A. As shown in FIG. 23, input signals (voice signals) X of a predetermined number (=N) of samples per frame are input to an LPC analyzer 1 frame by frame. If the sampling speed is 8 kHz and the period of a single frame is 10 ms, then one frame is composed of 80 samples. The LPC analyzer 1, which is regarded as an all-pole filter represented by Equation (1), obtains filter coefficients αi (i=1, . . . , p), where p represents the order of the filter. Generally, in the case of voice in the telephone band, a value of 10 to 12 is used as p. The LPC analyzer 1 performs LPC analysis using the input signal (80 samples), 40 pre-read samples and 120 past samples, for a total of 240 samples, and obtains the LPC coefficients.
A parameter converter 2 converts the LPC coefficients to LSP (Line Spectrum Pair) parameters. An LSP parameter is a parameter of a frequency region in which mutual conversion with LPC coefficients is possible. Since a quantization characteristic is superior to LPC coefficients, quantization is performed in the LSP domain. An LSP quantizer 3 quantizes an LSP parameter obtained by the conversion and obtains an LSP code and an LSP dequantized value. An LSP interpolator 4 obtains an LSP interpolated value from the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame. More specifically, one frame is divided into two subframes, namely first and second subframes, of 5 ms each, and the LPC analyzer 1 determines the LPC coefficients of the second subframe but not of the first subframe. Using the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame, the LSP interpolator 4 predicts the LSP dequantized value of the first subframe by interpolation.
A parameter reverse converter 5 converts the LSP dequantized value and the LSP interpolated value to LPC coefficients and sets these coefficients in an LPC synthesis filter 6. In this case, the LPC coefficients converted from the LSP interpolated values in the first subframe of the frame and the LPC coefficients converted from the LSP dequantized values in the second subframe are used as the filter coefficients of the LPC synthesis filter 6. In subsequent description, 1 having subscript(s) is not a numeral, but an alphabet.
After LSP parameters LSPi (i=1, . . . , p) are quantized as by scalar quantization or vector quantization in the LSP quantizer 3, the quantization indices (LSP codes) are sent to a decoder. FIG. 24 is a diagram useful in describing the quantization method. Here sets of large numbers of quantization LSP parameters have been stored in a quantization table 3a in correspondence with index numbers 1 to n. A distance calculation unit 3b calculates distance in accordance with the following equation:d=W·Σi{lspq(i)−lsp(i)}2(i=1−p)where W represents a weighting coefficient.
When q is varied from 1 to n, a minimum-distance index detector 3c finds the q for which the distance d is minimum and sends the index q to the decoder side as an LSP code.
Next, sound-source and gain search processing is executed. Sound source and gain are processed on a per-subframe basis. In accordance with CELP, a sound-source signal is divided into a pitch-period component and a noise component, an adaptive codebook 7 storing a sequence of past sound-source signals is used to quantize the pitch-period component and an algebraic codebook 8 or noise codebook is used to quantize the noise component. Described below will be typical CELP-type voice encoding using the adaptive codebook 7 and algebraic codebook 8 as sound-source codebooks.
The adaptive codebook 7 is adapted to output N samples of sound-source signals (referred to as “periodicity signals”), which are delayed successively by one sample, in association with indices 1 to L. FIG. 25 is a diagram showing the structure of the adaptive codebook 7 in the case of a subframe of 40 samples (N=40). The adaptive codebook is constituted by a buffer BF for storing the pitch-period component of the latest (L+39) samples. A periodicity signal comprising 1 to 40 samples is specified by index 1, a periodicity signal comprising 2 to 41 samples is specified by index 2, . . . , and a periodicity signal comprising L to L+39 samples is specified by index L. In the initial state, the content of the adaptive codebook 7 is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe so that the sound-source signal obtained in the present frame will be stored in the adaptive codebook 7.
An adaptive-codebook search identifies the periodicity component of the sound-source signal using the adaptive codebook 7 storing past sound-source signals. That is, a subframe length (=40 samples) of past sound-source signals in the adaptive codebook 7 are extracted while changing, one sample at a time, the point at which read-out from the adaptive codebook 7 starts, and the past sound-source signals are input to the LPC synthesis filter 6 to create a pitch synthesis signal βAPL, where PL represents a past periodicity signal (adaptive code vector), which corresponds to delay L, extracted from the adaptive codebook 7, A the impulse response of the LPC synthesis filter 6, and β the gain of the adaptive codebook.
An arithmetic unit 9 finds an error power EL between the input voice X and βAPL in accordance with the following equation:EL=|X−βAPL|2  (2)
If we let APL represent a weighted synthesized signal output from the adaptive codebook, Rpp the autocorrelation of APL and Rxp the cross-correlation between APL and the input signal X, then an adaptive code vector PL at a pitch lag Lopt for which the error power of Equation (2) is minimum will be expressed by the following equation:
                                                                        P                L                            =                              arg                ⁢                                                                  ⁢                                  max                  (                                                            R                      xp                      2                                                              R                      pp                                                        )                                                                                                        =                              arg                ⁢                                                                  ⁢                                  max                  ⁡                                      (                                                                                            (                                                                                    X                              ⊤                                                        ⁢                                                          AP                              L                                                                                )                                                2                                                                                                                          (                                                          AP                              L                                                        )                                                    ⊤                                                ⁢                                                  (                                                      AP                            L                                                    )                                                                                      )                                                                                                          (        3        )            That is, the optimum starting point for read-out from the adaptive codebook is that at which the value obtained by normalizing the cross-correlation Rxp between the weighted systhesized signal APL and the input signal X by the autocorrelation Rpp of the weighted systhesized signal is largest. Accordingly, an error-power evaluation unit 10 finds the pitch lag Lopt that satisfies Equation (3). Optimum pitch gain βopt is given by the following equation:βopt=Rxp/Rpp  (4)
Next, the noise component contained in the sound-source signal is quantized using the algebraic codebook 8. The latter is constituted by a plurality of pulses of amplitude 1 or −1. By way of example, FIG. 26 illustrates pulse positions for a case where frame length is 40 samples. The algebraic codebook 8 divides the N (=40) sampling points constituting one frame into a plurality of pulse-system groups 1 to 4 and, for all combinations obtained by extracting one sampling point from each of the pulse-system groups, successively outputs, as noise components, pulsed signals having a +1 or a −1 pulse at each sampling point. In this example, basically four pulses are deployed per frame. FIG. 27 is a diagram useful in describing sampling points assigned to each of the pulse-system groups 1 to 4.
(1) Eight sampling points 0, 5, 10, 15, 20, 25, 30, 35 are assigned to the pulse-system group 1;
(2) eight sampling points 1, 6, 11, 16, 21, 26, 31, 36 are assigned to the pulse-system group 2;
(3) eight sampling points 2, 7, 12, 17, 22, 27, 32, 37 are assigned to the pulse-system group 3; and
(4) 16 sampling points 3, 4, 8, 9, 13, 14, 18, 19, 23, 24, 28, 29, 33, 34, 38, 39 are assigned to the pulse-system group 4.
Three bits are required to express the sampling points in pulse-system groups 1 to 3 and one bit is required to express the sign of a pulse, for a total of four bits. Further, four bits are required to express the sampling points in pulse-system group 4 and one bit is required to express the sign of a pulse, for a total of five bits. Accordingly, 17 bits are necessary to specify a pulsed signal output from the algebraic codebook 8 having the pulse placement of FIG. 26, and 217 (=24×24×24×25) types of pulsed signals exist.
The pulse positions of each of the pulse systems are limited as illustrated in FIG. 26. In the algebraic codebook search, a combination of pulses for which the error power relative to the input voice is minimized at the reproduction is decided from among the combinations of pulse positions of each of the pulse systems. More specifically, with βopt as the optimum pitch gain found by the adaptive-codebook search, the output PL of the adaptive codebook is multiplied by βopt and the product input to an adder 11. At the same time, the pulsed signals are input successively to the adder 11 from the algebraic codebook 8 and a pulse signal is specified that will minimize the difference between the input signal X and a reproduced signal obtained by inputting the adder output to the LPC synthesis filter 6. More specifically, first a target vector X′ for an algebraic codebook search is generated in accordance with the following equation using the optimum adaptive codebook output PL and optimum pitch gain βopt obtained from the input signal X by the adaptive-codebook search:X′=X−βoptAPL  (5)
In this example, pulse position and amplitude (sign) are expressed by 17 bits and therefore 217 combinations exist, as mentioned above. Accordingly, letting CK represent a kth algebraic-code output vector, a code vector CK that will minimize an evaluation −function error output power D in the following equation is found by a search of the algebraic codebook:D=|X′−GcACK|2  (6)where Gc represents the gain of the algebraic codebook. Minimizing Equation (6) is equivalent to finding the CK, i.e., the k, that will minimize the following equation:
                              D          ′                =                                            (                                                X                                      ′                    ⊤                                                  ⁢                                                                  ⁢                                  AC                  k                                            )                        2                                                              (                                  AC                  k                                )                            ⊤                        ⁢                          (                              AC                k                            )                                                          (        7        )            Thus, in the algebraic codebook search, the error-power evaluation unit 10 searches for the k that specifies the combination of pulse position and polarity that will afford the largest value obtained by normalizing the cross-correlation between the algebraic synthesis signal ACK and target signal X′ by the autocorrelation of the algebraic synthesis signal ACK.
Gain quantization will be described next. With the G.729A system, the algebraic codebook gain is not quantized directly. Rather, the adaptive codebook gain Ga (=βopt) and a correction coefficient γ of the algebraic codebook gain Gc are vector quantized together. The algebraic codebook gain Gc and the correction coefficient γ are related as follows:Gc=g′×γwhere g′ represents the gain of the present frame predicted from the logarithmic gains of four past subframes. A gain quantizer 12 has a gain quantization table (gain codebook), not shown, for which there are prepared 128 (=27) combinations of adaptive codebook gain Ga and correction coefficients γ for algebraic codebook gain. The method of the gain codebook search includes (1) extracting one set of table values from the gain quantization table with regard to an output vector from the adaptive codebook 7 and an output vector from the algebraic codebook 8 and setting these values in gain varying units 13, 14, respectively; (2) multiplying these vectors by gains Ga, Gc using the gain varying units 13, 14, respectively, and inputting the products to the LPC synthesis filter 6; and (3) selecting, by way of the error-power evaluation unit 10, the combination for which the error power relative to the input signal X is smallest.
A line encoder 15 creates line data by multiplexing (1) an LSP code, which is the quantization index of the LSP, (2) a pitch-lag code Lopt, (3) an algebraic code, which is an algebraic codebook index, and (4) a gain code, which is a quantization index of gain, and sends the line data to the decoder.
Thus, as described above, the CELP system produces a model of the voice generation process, quantizes the characteristic parameters of this model and transmits the parameters, thereby making it possible to compress voice efficiently.
Decoder Structure and Operation
FIG. 28 is a block diagram illustrating a G.729A-compliant decoder. Line data sent from the encoder side is input to a line decoder 21, which proceeds to output an LSP code, pitch-lag code, algebraic code and gain code. The decoder decodes voice data based upon these codes. The operation of the decoder will now be described, though parts of the description will be redundant because functions of the decoder are included in the encoder.
Upon receiving the LSP code as an input, an LSP dequantizer 22 applies dequantization and outputs an LSP dequantized value. An LSP interpolator 23 interpolates an LSP dequantized value of the first subframe of the present frame from the LSP dequantized value in the second subframe of the present frame and the LSP dequantized value in the second subframe of the previous frame. Next, a parameter reverse converter 24 converts the LSP interpolated value and the LSP dequantized value to LPC synthesis filter coefficients. A G.729A-compliant synthesis filter 25 uses the LPC coefficient converted from the LSP interpolated value in the initial first subframe and uses the LPC coefficient converted from the LSP dequantized value in the ensuing second subframe.
An adaptive codebook 26 outputs a pitch signal of subframe length (=40 samples) from a read-out starting point specified by a pitch-lag code, and a noise codebook 27 outputs a pulse position and pulse polarity from a read-out position that corresponds to an algebraic code. A gain dequantizer 28 calculates an adaptive codebook gain dequantized value and an algebraic codebook gain dequantized value from the gain code applied thereto and sets these vales in gain varying units 29, 30, respectively. A adder 31 creates a sound-source signal by adding a signal, which is obtained by multiplying the output of the adaptive codebook by the adaptive codebook gain dequantized value, and a signal obtained by multiplying the output of the algebraic codebook by the algebraic codebook gain dequantized value. The sound-source signal is input to an LPC synthesis filter 25. As a result, reproduced voice can be obtained from the LPC synthesis filter 25.
In the initial state, the content of the adaptive codebook 26 on the decoder side is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe so that the sound-source signal obtained in the present frame will be stored in the adaptive codebook 26. In other words, the adaptive codebook 7 of the encoder and the adaptive codebook 26 of the decoder are always maintained in the identical, latest state.
Difference between G.729A and AMR-Compliant Encoding Methods
The difference between the G.729-compliant voice encoding method and the AMR voice encoding method will be described next. FIG. 29 illustrates results obtained by comparing the main features of the G.729A and AMR voice encoding methods. It should be noted that although there are a total of eight types of AMR encoding modes, the particulars shown in FIG. 29 are common for all encoding modes. The G.729A and AMR voice encoding methods have the same input-signal sampling frequency (=8 kHz), the same subframe length (=5 ms) and the same order of linear prediction (=order ten). However, as shown in FIG. 30, they have different frame lengths and different numbers of subframes per frame. In the G.729A method, one frame is composed of two subframes, namely 0th and 1st subframes; in the AMR method, one frame is composed of four subframes, namely 0th to 3rd subframes.
FIG. 31 illustrates the result of comparing the bit assignments of the G.729A and AMR methods. FIG. 31 illustrates a case where the mode for the AMR method is 7.95 kbps, which is nearest to the bit rate of the G.729A method. It is obvious from FIG. 31 that although the numbers of bits (=17) of the algebraic codebook per subframe are the same, the allocations of numbers of bits necessary for other codes differ entirely. Further, with the G.729A method, adaptive codebook gain and algebraic codebook gain are vector quantized collectively and, as a consequence, there is one type of gain code per subframe. With the AMR method, however, there are two types of gain codes, namely adaptive codebook gain and algebraic codebook gain, per subframe.
As described above, a common basic algorithm is used by the G.729A method now employed widely for VoIP in the communication of voice over the Internet and by the AMR method adopted for the next-generation cellular telephone system. However, the frame lengths differ and so do the numbers of bits expressing the codes.
It is believed that the growing popularity of the Internet and cellular telephones will lead to ever increasing voice traffic by Internet users and users of cellular telephone networks. FIG. 32 is a conceptual view illustrating the relationship between networks and users in such case. In a case where a user A of a network (e.g., the Internet) 51 communicates by voice with a user B of a network (e.g., a cellular telephone network) 53, communication between the users cannot take place if a first encoding method used in voice communication by the network 51 and a second encoding method used in voice communication by the network 53 differ.
Accordingly, a voice code converter 55 is provided between the networks, as shown in FIG. 32, and is adapted to convert the voice code that has been encoded by one network to the voice code of the encoding method used in the other network.
FIG. 33 shows an example of the prior art using voice code conversion. This example takes into consideration only a case where voice input to a terminal 52 by user A is sent to a terminal 54 of user B. It is assumed here that the terminal 52 possessed by user A has only an encoder 52a of an encoding method 1 and that the terminal 54 of user B has only a decoder 54a of an encoding method 2.
Voice that has been produced by user A on the transmitting side is input to the encoder 52a of encoding method 1 incorporated in terminal 52. The encoder 52a encodes the input voice signal to a voice code of the encoding method 1 and outputs this code to a transmission path 51′. When the voice code of encoding method 1 enters via the transmission path 51′, a decoder 55a of the voice code converter 55 decodes reproduced voice from the voice code of encoding method 1. An encoder 55b of the voice code converter 55 then converts the reproduced voice signal to voice code of the encoding method 2 and sends this voice code to a transmission path 53′. The voice code of the encoding method 2 is input to the terminal 54 through the transmission path 53′. Upon receiving the voice code of the encoding method 2 as an input , the decoder 54a decodes reproduced voice from the voice code of the encoding method 2. As a result, the user B on the receiving side is capable of hearing the reproduced voice. Processing for decoding voice that has first been encoded and then re-encoding the decoded voice is referred to as “tandem connection”.
Voice (reproduced voice) consisting of information compressed by encoding processing contains a lesser amount of voice information in comparison with the original voice (source) and, hence, the sound quality of reproduced voice is inferior to that of the source. In particular, with recent low-bit-rate voice encoding typified by the G.729A and AMR methods, much information contained in input voice is discarded in the encoding process in order to realize a high compression rate. When a tandem connection in which encoding and decoding are repeated is employed, a problem which arises is a marked decline in the quality of reproduced voice.
An additional problem with tandem processing is delay. It is known that when a delay in excess of 100 ms occurs in two-way communication such as a telephone conversation, the delay is perceived by the communicating parties and is a hindrance to conversation. It is known also that even if real-time processing can be executed in voice encoding in which frame processing is carried out, a delay which is four times the frame length basically is unavoidable. For example, since frame length in the AMR method is 20 ms, the delay is at least 80 ms. With the conventional method of voice code conversion, tandem connection is required in the G.729A and AMR methods. The delay in such case is 160 ms or greater. Such a delay is perceivable by the parties in a telephone conversation and is an impediment to conversation.
As described above, in order for voice communication to be performed between networks employing different voice encoding methods, the conventional practice is to execute tandem processing in which a compressed voice code is decoded into voice and then the voice code is re-encoded. Problems arise as a consequence, namely a pronounced decline in the quality of reproduced voice and an impediment to telephone conversion caused by delay.
Another problem is that the prior art does not take the effects of transmission-path error into consideration. More specifically, if wireless communication is performed using a cellular telephone and, bit error or burst error occurs owing to the influence of phenomena such as phasing, the voice code changes to one different from the original and there are instances where the voice code of an entire frame is lost. If traffic is heavy over the Internet, transmission delay grows, the voice code of an entire frame may be lost or frames may change places in terms of their order. Since code conversion will be performed based upon a voice code that is incorrect if transmission-path error is a factor, a conversion to the optimum voice code can no longer be achieved. Thus there is need for a technique that will reduce the effects of transmission-path error.