The present invention relates generally to processing of telecommunication signals. More particularly, the present invention relates to a method and apparatus for transcoding a bitstream encoded by a first voice speech coding format into a bitstream encoded by a second variable-rate voice coding format. Merely by way of example, the invention has been applied to variable-rate voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
Telecommunication techniques have progressed through the years. One of the major desires of speech coding development is high quality output speech at a low average data rate. One approach is to employ a variable bit-rate scheme, whereby the transmission rate is not only determined by the network traffic but also from the characteristics of the input speech signal. For example, when the signal is highly voiced, a high bit rate may be chosen; if the signal is weak, a low bit rate is chosen; and if the signal has mostly silence or background noise, a lower bit rate is chosen. This often provides efficient allocation of the available bandwidth, without sacrificing output voice quality. Such variable-rate coders include the TIA IS-127 Enhanced Variable Rate Codec (EVRC), and 3rd generation partnership project 2 (3GPP2) Selectable Mode Vocoder (SMV). These coders use Rate Set 1 of the Code Division Multiple Access (CDMA) communication standards IS-95 and cdma2000, which include rates of 8.55 kbit/s (Rate 1 or full Rate), 4.0 kbit/s (half-rate), 2.0 kbit/s (quarter-rate) and 0.8 kbit/s (eighth rate). SMV selects the bit rate based on the input speech characteristics and operates in one of six network controlled modes, which limit the bit rate during high traffic. Depending on the mode of operation, different thresholds may be set to determine the rate usage percentages.
To accurately decide the desired transmission rate, and obtain high quality output speech at that rate, input speech frames are categorized into various classes. For example, in SMV, these classes include silence, unvoiced, onset, plosive, non-stationary voiced and stationary voiced speech. It is known that certain coding techniques are better suited for certain classes of sounds. Also, some types of sounds, for example, voice onsets or unvoiced-to-voiced transition regions, have higher perceptual significance and thus generally require higher coding accuracy than other classes of sounds, such as unvoiced speech. Thus, the speech frame classification may be used, not only to decide the most efficient transmission rate, but also the best-suited coding algorithm.
Accurate classification of input speech frames is desired to fully exploit the signal redundancies and perceptual importance. Typical frame classification techniques include voice activity detection, measuring the amount of noise in the signal, measuring the level of voicing, detecting speech onsets, and measuring the energy in a number of frequency bands. These measures generally require the calculation of numerous parameters, such as maximum correlation values, line spectral frequencies, and frequency transformations.
While coders such as SMV achieve much better quality at lower average data rate than existing speech codecs at similar bit rates, the frame classification and rate determination algorithms are complex. In the case of a tandem connection of two speech vocoders, however, many of the measurements performed for frame classification have already been calculated in the source codec. This can be capitalized on in a transcoding framework. In transcoding from the bitstream format of one CELP codec to the bitstream format of another CELP codec, rather than fully decoding to PCM and re-encoding the speech signal, smart interpolation methods may be applied directly in the CELP parameter space. Hence the parameters, such as pitch lag, pitch gain, fixed codebook gain, line spectral frequencies and the source codec bit rate are available to the destination codec. This allows frame classification and rate determination of the destination voice codec to be performed in a fast manner.
The simplest method of transcoding is a brute-force approach called tandem transcoding, shown in FIG. 1. This method performs a full decode 110 of the incoming compressed bits to produce synthesized speech 112. The synthesized speech is then encoded 114 for the target standard. This method is undesirable because of the huge amount of computation performed in re-encoding the signal, as well as quality degradations introduced by pre- and post-filtering of the speech waveform, and the potential delays introduced by the look-ahead-requirements of the encoder.
Methods for “smart” transcoding similar to that illustrated in FIG. 2 have appeared in the literature. These methods essentially reconstruct the speech signal and then perform significant work to extract the various CELP parameters such as line spectral frequencies and pitch. That is, these methods still operate in the speech signal space. In particular, the excitation signal which has already been optimally matched to the original speech by the far-end source encoder (encoder that produced the compressed speech according to a compression format) is often only used for the generation of the synthesized speech. The synthesized speech is then used to compute a new optimal excitation. Due to the requirement of incorporating impulse response filtering operations in closed-loop searches of the excitation parameters, this becomes a very computationally intensive operation.
Further, these transcoding methods do not cover the transcoding between variable-rate voice coders which determine the bit rate based on the characteristics of the input speech and, in some cases, external commands. During the transcoding process, the frame classification and rate decision of the destination voice codec in transcoding are still computed through the speech signal domain. The transcoder thus includes the equivalent amount of computational resources as the destination codec to classify frame types and to determine the bit rates. The smart transcoding of previous methods may lose part of their computational advantage, as the classification algorithms require parameters from intermediate stages of functions that have been omitted. For example, recalculation of the line spectral frequencies is often not performed in transcoding, however, the LPC prediction gain, LPC prediction error, autocorrelation function and reflection coefficients are often required in the classification and rate determination process.
From the above, it is seen that improved telecommunication techniques are desired.