1. Field of the Invention
The present invention generally relates to speech processing and coding and, more particularly, to transcoding of coded speech signals.
2. Background Art
The explosive growth of the cellular communications has been accompanied by many challenges facing the expansion of cellular networks having the need to connect diverse types of cellular devices with greater effectiveness. More specifically, because different cellular devices may be using different standards to encode, compress or packetize speech, a transcoding procedure has to be performed in order for a meaningful connection between cellular devices to be achieved. Typically, voice data encoded according to one standard from a transmitting participant communicating in one network has to be converted to the standard used by the receiving participant communicating under the guidelines of another network. For example, a transmitting participant's speech may be encoded according to EVRC specifications while the receiving participant uses AMR. In order for the data from the transmitting participant to be understood by the receiving participant, the bit-stream from the transmitting participant has to be converted from EVRC format to AMR format.
In conventional transcoding approaches, encoded data from the transmitting participant is decoded according to the coding method used by the transmitting participant. The decoded data is then re-encoded in accordance with the coding method used by the receiving participant. In the re-encoded form, the data is transmitted to the receiving participant. Known transcoding schemes, however, suffer numerous serious inadequacies. For example, the decoding and re-encoding of the speech signal (a “tandem” process), reduces the quality of the speech. For example, the tandem operation of the post-filter, common in low bit-rate speech decoders, can generate objectionable spectral distortion and degrade the speech quality significantly.
Another drawback of known transcoding schemes is the undesirable delay resulting from the re-encoding step. Typically, re-encoding of the decoded bit-stream requires that the speech signal characteristics be evaluated. As such, parameters including energy, spectral characteristics and pitch, for example, have to be extracted from the bit-stream and used to re-encode the signal. Often, such evaluation is also performed on a look-ahead portion of the signal, which increases the delay. Furthermore, in addition to delay, the need to extract these parameters as part of the re-encoding step can introduce inaccuracy in the extraction of the parameters and greater complexity to the system.
Today, a specific problem arises for transcoding in GSM (Global Systems for Mobile Communications) when transcoding between EFR (Enhanced Full Rate) coded speech and AMR (Adaptive Multi-Rate) coded speech at 12.2 Kbps involving Silence Insertion Descriptor (SID) frames. By way background, when active periods of speech are detected by voice activity detector (VAD), EFR and AMR (at 12.2 Kbps mode) use 12.2 Kbps to code the active speech. However, when inactive periods of speech are detected by the VAD, EFR and AMR encoders can choose to send an information update called a silence insertion descriptor (SID) to the inactive decoder, or to send nothing. This technique is named discontinuous transmission (DTX). Completely muting the output during inactive speech segments will create sudden drops of the signal energy level which are perceptually unpleasant. Therefore, in order to fill these inactive speech segments, a description of the background noise (i.e. the SID) is sent from the EFR or AMR encoder to the decoder. Using the SID, the decoder generates an output signal, which is perceptually equivalent to the background noise in the encoder. Such a signal is commonly called comfort noise, which is generated by a comfort noise generator (CNG) within the decoder.
Although EFR and AMR bitstreams for coded active speech at 12.2 Kbps are similar and compatible in all aspects, EFR and AMR bitstreams diverge and are different for the SID frames which represent inactive speech. For example, AMR specification defines a 39-bit SID frame for 2G and 3G networks, whereas EFR specification defines a 244-bit SID frame for 2G networks and a 43-bit SID frame for 3G networks. The undesirable effects of this incompatibility are explained below with reference to FIG. 1.
FIG. 1 illustrates conventional communication system 100, which includes first gateway (or GW1) 120 and second gateway (or GW2) 130, which may operate in a Tandem Free Operation (or TFO) network, which is described in 3GPP TS 28.062 V6.3.0 (2006-09), entitled “Inband Tandem Free Operation (TFO) of Speech Codecs,” which is hereby incorporated by reference in its entirety in the present application. Communication system 100 also includes first mobile codec 110 and second mobile codec 140 in communication via GW1 120 and GW2 130. According to TFO networks, assuming first mobile codec 110 is operating in EFR 12.2 Kbps mode, the EFR 12.2 Kbps encoder generates a coded-speech input bitstream 112, which is transmitted by first mobile codec 110 to GW1 120. Within GW1 120, EFR 12.2 Kbps decoder 122 decodes stream in 112 and generates decoded speech 123, which is provided to G.711 encoder 126 to generate G.711 encoded speech 127. Bit stealing module 124 receives G.711 encoded speech 127 and also receives stream in 112 from first mobile codec 110. Bit stealing module 124 alters G.711 encoded speech 127 by allocating a few bits from each sample of G.711 encoded speech 127, such as two bits per sample, for transmission of bits from stream in 112, generating TDM speech+stream 125. TDM speech+stream 125, which includes both altered G.711 encoded speech 127 and bits from stream in 112, is transmitted from GW1 120 to GW2 130.
At the other end of the TDM network, upon receipt of TDM speech+stream 125 by GW2 130, the allocated bits which represent stream in 112 are provided to stream extractor 134 to generate stream 111. The other bits, which represent the altered G.711 encoded speech 127 are decoded by G.711 decoder 128 to generate decoded G.711 speech 129, which is provided to AMR 12.2 Kbps encoder 132 for encoding the according to AMR 12.2 Kbps specifications to generate stream out 131. TFO switch 135 can make a choice and to send either stream 131 or stream 111 as stream out 136, which is then decoded and by AMR 12.2 Kbps decoder in mobile codec 140. Sending stream 111 will provide better speech quality at the output of mobile codec 140, since it does not involve the tandem decoding and encoding in GW1 120 and GW2 130. The advantage of this TFO configuration is that if GW2 130 does not implement the TFO functionality, it can still receive TDM speech+stream 125 and operate with mobile codec 140, which means the GW1 120 can communicate with both TFO-enable gateways as well as with TFO-unable gateways. However, when SID frames are utilized there is no compatibility between EFR 12.2 Kbps coded speech and AMR 12.2 Kbps coded speech. As a result, the only way for communication system 100 to perform properly is for TFO switch to send stream 131 as stream out 136, which introduces tandem coding, and considerable delay and overhead for communication system 100. Moreover, Transcoder Free Operation (or TrFO), in which stream in 112 is transmitted directly to stream out 136 over packet network, can not be used at all when SID frames are utilized. TrFO is described in 3GPP TS 23.153 V7.2.0 (2007-03), entitled “Out of Band Transcoder Control,” which is hereby incorporated by reference in its entirety in the present application.
Thus, there is an intense need in the art for an efficient transcoding method, and related system, which can overcome the shortcomings in the art relating to EFR 12.2 Kbps and AMR 12.2 Kbps coded speech.