The present invention relates generally to processing of telecommunication signals. More particularly, the invention provides a method and apparatus for classifying speech signals and determining a desired (e.g., efficient) transmission rate to code the speech signal with one encoding method when provided with the parameters of another encoding method. Merely by way of example, the invention has been applied to voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
An important feature of speech coding development is to provide high quality output speech at low average data rate. To achieve this, one approach adapts the transmission rate based on the network traffic. This is the approach adopted by the Adaptive Multi-Rate (AMR) codec used for Global System for Mobile (GSM) Communications. In AMR, one of eight data rates is selected by the network, and can be changed on a frame basis. Another approach is to employ a variable bit-rate scheme Such variable bit rate scheme uses a transmission rate determined from the characteristics of the input speech signal. For example, when the signal is highly voiced, a high bit rate may be chosen, and if the signal has mostly silence or background noise, a low bit rate is chosen. This scheme often provides efficient allocation of the available bandwidth, without sacrificing output voice quality. Such variable-rate coders include the TIA IS-127 Enhanced Variable Rate Codec (EVRC), and 3rd generation partnership project 2 (3GPP2) Selectable Mode Vocoder (SMV). These coders use Rate Set 1 of the Code Division Multiple Access (CDMA) communication standards IS-95 and cdma2000, which is made of the rates 8.55 kbit/s (Rate 1 or full Rate), 4.0 kbit/s (half-rate), 2.0 kbit/s (quarter-rate) and 0.8 kbit/s (eighth rate). SMV combines both adaptive rate approaches by selecting the bit-rate based on the input speech characteristics as well as operating in one of six network controlled modes, which limits the bit-rate during high traffic. Depending on the mode of operation, different thresholds may be set to determine the rate usage percentages.
To accurately decide the best transmission rate, and obtain high quality output speech at that rate, input speech frames are categorized into various classes. For example, in SMV, these classes include silence, unvoiced, onset, plosive, non-stationary voiced and stationary voiced speech. It is generally known that certain coding techniques are often better suited for certain classes of sounds. Also, certain types of sounds, for example, voice onsets or unvoiced-to-voiced transition regions, have higher perceptual significance and thus should require higher coding accuracy than other classes of sounds, such as unvoiced speech. Thus, the speech frame classification may be used, not only to decide the most efficient transmission rate, but also the best-suited coding algorithm.
Accurate classification of input speech frames is typically required to fully exploit the signal redundancies and perceptual importance. Typical frame classification techniques include voice activity detection, measuring the amount of noise in the signal, measuring the level of voicing, detecting speech onsets, and measuring the energy in a number of frequency bands. These measures would require the calculation of numerous parameters, such as maximum correlation values, line spectral frequencies, and frequency transformations.
While coders such as SMV achieve much better quality at lower average data rate than existing speech codecs at similar bit rates, the frame classification and rate determination algorithms are generally complex. However, in the case of a tandem connection of two speech vocoders, many of the measurements desired to perform frame classification have already been calculated in the source codec. This can be capitalized on in a transcoding framework. In transcoding from the bitstream format of one Code Excited Linear Prediction (CELP) codec to the bitstream format of another CELP codec, rather than fully decoding to PCM and re-encoding the speech signal, smart interpolation methods may be applied directly in the CELP parameter space. Here, the term “smart” is those commonly understood by one of ordinary skill in the art. Hence the parameters, such as pitch lag, pitch gain, fixed codebook gain, line spectral frequencies and the source codec bit rate are available to the destination codec. This allows frame classification and rate determination of the destination voice codec to be performed in a fast manner. Depending upon the application, many limitations can exist in one or more of the techniques described above.
Although there has been much improvement in techniques for voice transcoding, it would be desirable to have improved ways of processing telecommunication signals.