This invention relates to a speech transcoding method and apparatus. More particularly, the invention relates to a speech transcoding method and apparatus for transcoding speech code, which has been encoded by a speech code encoding apparatus used in a network such as the Internet or by a speech encoding apparatus used in a mobile/cellular telephone system, to speech code of another encoding scheme.
There has been an explosive increase in subscribers to cellular telephones in recent years and it is predicted that the number of such users will continue to grow in the future. Speech communication using the Internet (Speech over IP, or VoIP) is coming into increasingly greater use in intracorporate networks (intranets) and for the provision of long-distance telephone service. In such speech communication systems, use is made of speech encoding technology for compressing speech in order to utilize the communication channel effectively. The speech encoding scheme used, however, differs from system to system. For example, with regard to W-CDMA expected to be employed in the next generation of cellular telephone systems, AMR (Adaptive Multi-Rate) has been adopted as the common global speech encoding scheme. With VoIP, on the other hand, a scheme compliant with ITU-T Recommendation G.729A is being used widely as the speech encoding method.
It is believed that the growing popularity of the Internet and cellular telephones will be accompanied in the future by an increase in traffic involving speech communication by Internet and cellular telephone users. However, since the speech encoding schemes for cellular telephone networks differ from those of networks such as the Internet, as mentioned above, communication between networks cannot proceed without making transcoding. In the prior art, therefore, it is necessary to transcode speech code encoded by one network to speech code according to a speech encoding scheme used in another network by employing a speech transcoder.
Speech Transcoding
FIG. 15 illustrates the principle of a typical speech transcoding method according to the prior art. This method shall be referred to below as “prior art 1”. In FIG. 15, only a case where speech input to a terminal 1 by user A is sent to a terminal 2 of user B will be considered. It is assumed here that the terminal 1 possessed by user A has only an encoder 1a of an encoding scheme 1 and that the terminal 2 of user B has only a decoder 2a of an encoding scheme 2.
Speech that has been produced by user A on the transmitting side is input to the encoder 1a of encoding scheme 1 incorporated in terminal 1. The encoder 1a encodes the input speech signal to a speech code of the encoding scheme 1 and outputs this code to a transmission line 1b. When the speech code of encoding scheme 1 enters via the transmission line 1b, a decoder 3a of the speech transcoder 3 decodes the speech code of encoding scheme 1 to decoding speech. An encoder 3b of the speech transcoder 3 then encodes the decoding speech signal to speech code of encoding scheme 2 and sends this speech code to a transmission line 2b. The speech code of encoding scheme 2 is input to the terminal 2 through the transmission line 2b. Upon receiving the speech code of encoding scheme 2 as an input, the decoder 2a decodes the speech code of the encoding scheme 2 to decoding speech. As a result, the user B on the receiving side is capable of hearing decoding speech. Processing for decoding speech that has once been encoded and then re-encoding the decoded speech is referred to as “tandem connection”.
In the composition of prior art 1, use is made of the tandem connection in which speech code that has been encoded by speech encoding scheme 1 is decoded to decoding speech, after which encoding is performed again by speech encoding scheme 2. As a consequence, a problem which arises is a marked decline in the quality of decoding speech and an increase in delay.
An example of a method of solving this problem of the tandem connection has been proposed (see the specification of Japanese Patent Application No. 2001-75427). The proposed method decomposes speech code into parameter code such as LSP code and pitch-lag code and converts each parameter code separately to code of another speech encoding scheme without restoring speech code to a speech signal. The principle of this method is illustrated in FIG. 16. This method shall be referred to below as “prior art 2”.
Encoder 1a of encoding scheme 1 encodes a speech signal produced by user A to a speech code of encoding scheme 1 and sends this speech code to transmission line 1b. A speech transcoding unit 4 transcodes the speech code of encoding scheme 1 that has entered from the transmission line 1b to a speech code of encoding scheme 2 and sends this speech code to transmission line 2b. Decoder 2a in terminal 2 decodes decoding speech from the speech code of encoding scheme 2 that enters via the transmission line 2b, and user B is capable of hearing decoding speech.
The encoding scheme 1 encodes a speech signal by {circumflex over (1)} a first LSP code obtained by quantizing LSP parameters found from linear prediction coefficients (LPC coefficients) obtained by frame-by-frame linear prediction analysis; {circumflex over (2)} a first pitch-lag code, which specifies the output signal of an adaptive codebook that is for outputting a periodic speech-source signal; {circumflex over (3)} a first algebraic code (noise code), which specifies the output signal of an algebraic codebook (or noise codebook) that is for outputting a noisy speech-source signal; and {circumflex over (4)} a first gain code obtained by quantizing pitch gain, which represents the amplitude of the output signal of the adaptive codebook, and algebraic gain, which represents the amplitude of the output signal of the algebraic codebook. The encoding scheme 2 encodes a speech signal by {circumflex over (1)} a second LPC code, {circumflex over (2)} a second pitch-lag code, {circumflex over (3)} a second algebraic code (noise code) and {circumflex over (4)} a second gain code, which are obtained by quantization in accordance with a quantization method different from that of the encoding scheme 1.
The speech transcoding unit 4 has a code demultiplexer 4a, an LSP code converter 4b, a pitch-lag code converter 4c, an algebraic code converter 4d, a gain code converter 4e and a code multiplexer 4f. The code demultiplexer 4a demultiplexes the speech code of the encoding scheme 1, which code enters from the encoder 1a of terminal 1 via the transmission line 1b, into codes of a plurality of components necessary to reconstruct a speech signal, namely {circumflex over (1)} LSP code, {circumflex over (2)} pitch-lag code, {circumflex over (3)} algebraic code and {circumflex over (4)} gain code. These codes are input to the code converters 4b, 4c, 4d and 4e, respectively. The latter transcode the entered LSP code, pitch-lag code, algebraic code and gain code of the encoding scheme 1 to LSP code, pitch-lag code, algebraic code and gain code of the encoding scheme 2, respectively, and the code multiplexer 4f multiplexes these codes of the encoding scheme 2 and sends the multiplexed signal to the transmission line 2b. 
FIG. 17 is a block diagram illustrating the speech transcoding unit in which the construction of the code converters 4b to 4e is clarified. Components in FIG. 17 identical with those shown in FIG. 16 are designated by like reference characters. The code demultiplexer 4a demultiplexes an LSP code 1, a pitch-lag code 1, an algebraic code 1 and a gain code 1 from the speech code based upon encoding scheme 1 that enters from the transmission line via an input terminal #1, and inputs these codes to the code converters 4b, 4c, 4d and 4e, respectively.
The LSP code converter 4b has an LSP dequantizer 4b1 for dequantizing the LSP code 1 of encoding scheme 1 and outputting an LSP dequantized value, and an LSP quantizer 4b2 for quantizing the LSP dequantized value using an LSP quantization table according to encoding scheme 2 and outputting an LSP code 2. The pitch-lag code converter 4c has a pitch-lag dequantizer 4c1 for dequantizing the pitch-lag code 1 of encoding scheme 1 and outputting a pitch-lag dequantized value, and a pitch-lag quantizer 4c2 for quantizing the pitch-lag dequantized value using a pitch-lag quantization table according to the encoding scheme 2 and outputting a pitch-lag code 2. The algebraic code converter 4d has an algebraic code dequantizer 4d1 for dequantizing the algebraic code 1 of encoding scheme 1 and outputting an algebraic-code dequantized value, and an algebraic code quantizer 4d2 for quantizing the algebraic-code dequantized value using an algebraic code quantization table according to the encoding scheme 2 and outputting an algebraic code 2. The gain code converter 4e has a gain dequantizer 4e1 for dequantizing the gain code 1 of encoding scheme 1 and outputting a gain dequantized value, and a gain quantizer 4e2 for quantizing the gain dequantized value using a gain quantization table according to encoding scheme 2 and outputting a gain code 2.
The code multiplexer 4f multiplexes the LSP code 2, pitch-lag code 2, algebraic code 2 and gain code 2, which are output from the quantizers 4b2, 4c2, 4d2 and 4e2, respectively, thereby creating a speech code based upon encoding scheme 2, and sends this speech code to the transmission line from an output terminal #2.
In the tandem connection scheme (prior art 1) illustrated in FIG. 15, the input is decoding speech that is obtained by decoding, into speech, a speech code that has been encoded according to encoding scheme 1, the decoding speech is encoded again and then is decoded. As a consequence, since speech parameters are extracted from decoding speech in which the amount of information has been reduced greatly in comparison with the original input speech signal to re-encoding (i.e., speech-information compression), the speech code obtained thereby is not necessarily the optimum speech code. By contrast, in accordance with the transcoding apparatus according to prior art 2 shown in FIG. 16, the speech code of encoding scheme 1 is transcoded to the speech code of encoding scheme 2 via the process of dequantization and quantization. As a result, it is possible to carry out speech transcoding with much less degradation in comparison with the tandem connection of prior art 1. An additional advantage is that since it is unnecessary to effect decoding into speech even once in order to perform the speech transcoding, there is little of the delay that is a problem with the conventional tandem connection.
Silence Compression
An actual speech communication system generally has a silence compression function for providing a further improvement in the efficiency of information transmission by making effective use of silence segments contained in speech. FIG. 18 is a conceptual view of a silence compression function. Human conversation includes silence segments such as quiet intervals or background-noise intervals that reside between speech activity segments. Transmitting speech information over silence segments is unnecessary, making it possible to utilize the communication channel effectively. This is the basic approach taken in silence compression. However, when a segment between speech activity intervals reconstructed on the receiving side becomes completely silent, an acoustically unnatural sensation is produced. Ordinarily, therefore, natural noise (so-called “comfort noise”) that will not give rise to an acoustically unnatural sensation is generated on the receiving side. In order to generate comfort noise that resembles an input signal, it is necessary to send comfort-noise information (referred to below as “CN information”) from the transmitting side. However, the quantity of information in CN information is small in comparison with speech. Moreover, since the nature of silence segments varies only gradually, CN information need not be transmitted at all times. Since this makes it possible to greatly reduce the quantity of transmitted information in comparison with the information in speech activity segments, the overall transmission efficiency of the communication channel can be improved. Such a silence compression function is implemented by a VAD (Speech Activity Detection) unit for detecting speech activity and silence segments, a DTX (Discontinuous Transmission) unit for controlling the generation and transmission of CN information on the transmitting side, and a CNG (Comfort Noise Generator) for generating comfort noise on the receiving side.
The principle of operation of the silence compression function will now be described with reference to FIG. 19.
On the transmitting side, an input signal that has been divided up into fixed-length frames (e.g., 80 sample/10 ms) is applied to a VAD 5a, which detects speech activity segments. The VAD 5a outputs a decision signal vad_flag, which is logical “1” when a speech activity segment is detected and logical “0” when a silence segment is detected. In case of a speech activity segment (vad_flag=1), switches SW1 to SW4 are all switched over to a speech side so that a speech encoder 5b on the transmitting side and a speech decoder 6a on the receiving side respectively encode and decode the speech signal in accordance with an ordinary speech encoding scheme (e.g., G.729A or AMR). In case of a silence segment (vad_flag=0), on the other hand, switches SW1 to SW4 are all switched over to a silence side so that a silence encoder 5c on the transmitting side executes silence-signal encoding processing, i.e., control for generating and transmitting CN information, under the control of a DTX unit (not shown), and so that a silence decoder 6b on the receiving side executes decoding processing, i.e., generates comfort noise, under the control of a CNG unit (not shown).
The operation of the silence encoder 5c and silence decoder 6b will be described next. FIG. 20 is a block diagram of this encoder and decoder, and FIGS. 21A, 21B are flowcharts of processing executed by the silence encoder 5c and silence decoder 6b, respectively.
A CN information generator 7a analyzes the input signal frame by frame and calculates a CN parameter for generation of comfort noise in a CNG unit 8a on the receiving side(step S101). Usually, approximate shape information of the frequency characteristic and amplitude information are used as CN parameters. A DTX controller 7b controls a switch 7c so as to control, frame by frame, whether the obtained CN information is or is not to be transmitted to the receiving side (S102). Methods of control include a method of exercising control adaptively in accordance with the nature of a signal and a method of exercising control periodically, i.e., at regular intervals. If transmission of the CN information is necessary (“YES” at step S102) the CN parameter is input to a CN quantizer 7d, which quantizes the CN parameter, generates CN code (S103) and transmits the code to the receiving side as channel data (S104). A frame in which CN information is transmitted shall be referred to as an “SID (Silence Insertion Descriptor) frame” below. Frames other than these frames are frames (“non-transmit frames”) in which CN information is not transmitted. If a “NO” decision is rendered at step S102, nothing is transmitted in the other frames (S105).
The CNG unit 8a on the receiving side generates comfort noise based upon the transmitted CN code. More specifically, the CN code transmitted from the transmitting side is input to a CN dequantizer 8b, which dequantizes this CN code to obtain the CN parameter (S111). The CNG unit 8a then uses this CN parameter to generate comfort noise (S112). In the case of a non-transmit frame, namely a frame in which a CN parameter does not arrive, comfort noise is generated using the CN parameter that was received last (S113).
Thus, in an actual speech communication system, a silence segment in a conversation is discriminated and information for generating acoustically natural noise on the receiving side is transmitted intermittently in this silence segment, thereby making it possible to further improve transmission efficiency. A silence compression function of this kind is adopted in the next-generation cellular telephone network and VoIP network mentioned earlier, in which schemes that differ depending upon the system are employed.
The silence compression functions used in G.729A (VoIP) and AMR (next-generation mobile telephone), which are typical encoding schemes, will now be described.
TABLE 1COMPARISON OF G.729A AND AMR SILENCECOMPRESSION FUNCTIONSG.729AAMRPROCESSED FRAME LENGTH10 ms (80 SAMPLES)20 ms (160 SAMPLES)TRANSMITTED CNLPC COEFFICIENTSLPC COEFFICIENTSINFORMATIONFRAME SIGNAL POWERFRAME SIGNAL POWERMETHOD OFLPCAVERAGE LPC COEFFICIENTAVERAGE LPC COEFFICIENTGENERATINGINFORMATIONOVER LAST 6 FRAMES OR LPCOVER LAST 8 FRAMESCNCOEFFICIENT OF PRESENT(CALCULATED IN LSPINFORMATIONFRAMEDOMAIN)FRAMEAVERAGE LOGARITHMIC POWERAVERAGE LOGARITHMIC POWERSIGNALOVER LAST 0–3 FRAMESOVER LAST 8 FRAMES (INPUTPOWER(LSP RESIDUAL-SIGNALSIGNAL DOMAIN)INFORMATIONDOMAIN)BITLPC10 BITS (QUANTIZATION IN29 BITS (QUANTIZATION INASSIGNMENTINFORMATIONLSP DOMAIN)LSP DOMAIN)OF CN CODEFRAME5 BITS 6 BITSSIGNALPOWERTOTAL15 BITS35 BITSDTX CONTROL METHODADAPTIVE CONTROLFIXED CONTROL(TRANSMISSION AT(TRANSMISSIONIRREGULAR INTERVALS INPERIODICALLY EVERY 8ACCORDANCE WITH SILENCEFRAMES)SIGNAL)HANGOVER CONTROL
LPC coefficients (linear prediction coefficients) and frame signal power are used as CN information in both G.729A and AMR. An LPC coefficient is a parameter that represents the approximate shape of the frequency characteristic of the input signal, and frame signal power is a parameter that represents the amplitude characteristic of the input signal. These parameters are obtained by analyzing the input signal frame by frame. A method of generating the CN information in G.729A and AMR will be described.
In G.729A, the LPC information is found as an average value of LPC coefficients over the last six frames inclusive of the present frame. The average value obtained or the LPC coefficient of the present frame is eventually used as the CN information taking account signal fluctuation in the vicinity of the SID frame. The decision as to which should be chosen is made by measuring distortion between the average LPC and the present LPC coefficient. If signal fluctuation (a large distortion) has been determined, the LPC coefficient of the present frame is used. The frame power information is found as a value obtained by averaging logarithmic power of an LPC prediction residual signal over 0 to 3 frames inclusive of the present frame. Here the LPC prediction residual signal is a signal obtained by passing the input signal through an LPC inversion filter frame by frame.
In AMR, the LPC information is found as an average value of LPC coefficients over the last eight frames inclusive of the present frame. The calculation of the average value is performed in a domain in which LPC coefficients have been converted to LSP parameters. Here LSP is a parameter of a frequency domain in which cross conversion with an LPC coefficient is possible. The frame-signal power information is found as a value obtained by averaging logarithmic power of the input signal over the last eight frames (inclusive of the present frame).
Thus, LPC information and frame-signal power information is used as the CN information in both the G.729A and AMR schemes, though the methods of generation (calculation) differ.
The CN information is quantized to CN code and the CN code is transmitted to a decoder. The bit assignment of the CN code in the G.729A and AMR schemes is indicated in Table 1. In G.729A, the LPC information is quantized at 10 bits and the frame power information is quantized at five bits. In the AMR scheme, on the other hand, the LPC information is quantized at 29 bits and the frame power information is quantized at six bits. Here the LPC information is converted to an LSP parameter and quantized. Thus, bit assignment for quantization in the G.729A scheme differs from that in the AMR scheme. FIGS. 22A and 22B are diagrams illustrating the structure of silence code (CN code) in the G.729A and AMR schemes, respectively.
In G.729A, the size of silence code is 15 bits, as shown in FIG. 22A, and is composed of LSP code I_LSPg (10 bits) and power code I_POWg (5 bits). Each code is constituted by an index (element number) of a codebook possessed by a G.729A quantizer. The details are as follows: (1) The LSP code I_LSPg is composed of codes LG1 (1 bit), LG2 (5 bits) and LG3 (4 bits), in which LG1 is prediction-coefficient changeover information of an LSP quantizer, and LG2, LG3 are indices of codebooks CBG1, CBG2 of the LSP quantizer, and (2) the power code I_POWg is an index of a codebook CBG3 of a power quantizer.
In the AMR scheme, the size of silence code is 35 bits, as shown in FIG. 22B, and is composed of LSP code I_LSPa (29 bits) and power code I_POWa (6 bits). The details are as follows: (1) The LSP code I_LSPa is composed of codes LA1 (3 bits), LA2 (8 bits), LA3 (9 bits) and LA4 (9 bits), in which the codes are indices of codebooks GBA1, GBA2, GBA3, GBA4 of an LSP quantizer, and (2) the power code I_POWa is an index of a codebook GBA5 of a power quantizer.
DTX Control
A DTX control method will be described next. FIG. 23 illustrates the temporal flow of DTX control in G.729A, and FIGS. 24, 25 illustrate the temporal flow of DTX control in AMR.
When a VAD unit detects a change from a speech activity segment (VAD_flag=1) to a silence segment (VAD_flag=0) in the G.729A scheme, the first frame in the silence segment is set as an SID frame. The SID frame is created by generation of CN information and quantization of CN information by the above-described method and is transmitted to the receiving side. In the silence segment, signal fluctuation is observed frame by frame, only a frame in which fluctuation has been detected is set as an SID frame and CN information is transmitted again in the SID frame. A frame for which fluctuation has not been detected is set as a non-transmit frame and no information is transmitted in this frame. A limitation is imposed according to which at least two non-transmit frames are included between SID frames. Fluctuation is detected by measuring the amount of change in CN information between the present frame and the SID frame transmitted last. In the G.729A scheme, as mentioned above, the setting of an SID frame is performed adaptively with respect to a fluctuation in the silence signal.
DTX control in the AMR scheme will be described with reference to FIGS. 24 and 25. In the AMR scheme, the method of setting SID frames is such that basically an SID frame is set periodically every eight frames, as shown in FIG. 24, unlike the adaptive control method in the G.729A scheme. However, hangover control is carried out, as shown in FIG. 25, at a point where there is a change to a silence segment following a long speech activity segment. More specifically, seven frames following the point of change are set as a speech activity segment regardless of the change to the silence segment (VAD_flag=0), and the usual speech encoding processing is executed with regard to these frames. This interval of seven frames is referred to as “hangover”. Hangover is set in a case where the number of frames (P-FRM) that follow the SID frame that was set last is 23 frames or greater. As a result of setting hangover, CN information at the point of change (the point at which the silence segment starts) is prevented from being found from a characteristic parameter of the speech activity segment (the last eight frames), enabling speech quality at the point of change from speech activity to silence to be improved.
The eighth frame is then set as the first SID frame (SID_FIRST frame). In the SID-FIRST frame, however, CN information is not transmitted. The reason for this is that the CN information can be generated from a decoded signal in the hangover interval by a decoder on the receiving side. The third frame after the SID_FIRST frame is set as an SID_UPDATE frame and here CN information is transmitted for the first time. In the silence segment from this point onward, a SID_UPDATE frame is set every eight frames. The SID_UPDATE frame is created by the above-described method and is transmitted to the receiving side. Frames other than these are set as non-transmit frames and CN information is not transmitted in these non-transmit frames.
In a case where the number of frames that follow the SID frame that was set last is less than 23 frames, as shown in FIG. 24, hangover control is not carried out. In this case, the frame at the point of change (the first frame of the silence segment) is set as SID_UPDATE. However, CN information is not calculated and the CN information transmitted last is transmitted again in this frame. As described above, DTX control in the AMR scheme transmits CN information under fixed control without performing adaptive control of the G.729A type, and therefore hangover control is exercised as appropriate taking into consideration the point which the change from speech activity to silence occurs.
As described above, the basic theory of the silence compression function according to the G.729A scheme is the same as that of the AMR scheme but the generation and quantization of CN information, and DTX control method differ between the two schemes.
FIG. 26 is a block diagram for a case where each of the communication systems has the silence compression function according to prior art 1. In the case of the tandem connection, the structure is such that speech code according to encoding scheme 1 is decoded to a decoding signal and the decoding signal is encoded again in accordance with encoding scheme 2, as described above. In a case where each system has the silence compression function, as shown in FIG. 26, a VAD unit 3c in the speech transcoder 3 renders a speech activity/silence segment decision with regard to the decoding signal obtained by encoding/decoding (information compression) performed according to encoding scheme 1. As a consequence, there are instances where the precision of the speech activity/silence segment decision by the VAD unit 3c declines and problems arise such as muted speech at the beginning of an utterance, which is caused by an erroneous decision. The end result is a decline in speech quality. Though a conceivable countermeasure is to process all segments as speech activity segments in encoding scheme 2, this approach will not allow optimum silence compression to be performed and the originally intended effect of improving transmission efficiency by silence compression will be lost. Furthermore, in a silence segment, CN information according to encoding scheme 2 is obtained from comfort noise generated by the decoder 3a of encoding scheme 1, and this is not necessarily the best CN information for generating noise that resembles the input signal.
Further, though prior art 2 is a speech transcoding method that is superior to prior art 1 (the tandem connection) in terms of diminished degradation of speech quality and transmission delay, a problem with this scheme is that it does not take the silence compression function into consideration. In other words, since prior art 2 assumes that information is information obtained by encoding entered speech code as a speech activity segment at all times, a normal transcoding operation cannot be carried out when an SID frame or non-transmit frame is generated by the silence compression function.