1. Field of the Invention
The present invention relates generally to the field of speech coding and, more particularly, to conversion schemes for use between discontinuous transmission silence description systems and continuous transmission silence description systems.
2. Related Art
Speech communication systems typically include an encoder, a communication channel and a decoder. A digitized speech signal is inputted into the encoder, which converts the speech signal into a bit stream at one end of the communication link. The bit-stream is then transmitted across the communication channel to the decoder, which processes the bit-stream to reconstruct the original speech signal. As part of the encoding process, the speech signal can be compressed in order to reduce the amount of data that needs to sent through the communication channel. The goal of compression is to minimize the amount of data needed to represent the speech signal, while still maintaining a high quality reconstructed speech signal. Various speech coding techniques are known in the art including, for example, linear predictive coding based methods that can achieve compression ratios of between 12 and 16. Accordingly, the amount of data that has to be sent across the communication channel is significantly lowered, which translates to greater system efficiency. For example, more efficient use of available bandwidth is possible since less data is transmitted.
A refinement of typical speech encoding techniques involves multi-mode encoding. With multi-mode encoding, different portions of a speech signal are encoded at different rates, depending on various factors, such as system resources, quality requirements and the characteristics of the speech signal. For example, a Selectable Mode Vocoder (xe2x80x9cSMVxe2x80x9d) can continually select optimal encoding rates, thereby providing enhanced speech quality while making it possible to increase system capacity.
Discontinuous transmission (xe2x80x9cDTXxe2x80x9d) is another method for reducing the amount of data that has to be transmitted across a communication channel. DTX takes advantage of the fact that only about 50% of a typical two-way conversation comprises actual speech activity, while the remaining 50% is silence or non-speech. Accordingly, DTX suspends speech-data transmission when it is detected that there is a pause in the conversation. Typically, devices operating in DTX mode require a Voice Activity Detector (xe2x80x9cVADxe2x80x9d) configured to determine where pauses occur in the speech signal and to power-on the transmitter only when voice activity is detected. DTX can operate in conjunction with multi-mode encoding to further reduce the amount of data needed to represent a speech signal and is thereby an effective means for increasing system capacity and conserving power resources. DTX is supported by various packet-based communication systems, including certain Voice-over-IP (xe2x80x9cVoIPxe2x80x9d) systems. For example, G.729 and G.723.1 are well-known Recommendations of the International Telecommunications Union (ITU), which support VoIP DTX-based speech coding schemes. In particular, the G.729 Recommendation provides for speech coding at a single rate of 8 Kbps, and the G723.1 Recommendation provides for a single rate of either 6.3 Kbps or 5.3 Kbps.
It is known, however, that not all current communications systems support DTX. For example, current Code Division Multiple Access (xe2x80x9cCDMAxe2x80x9d) systems require mobile units to be in continuous contact with a base station in order to receive and transmit various control signals. As such, discontinuous transmission is not supported since transmission cannot be powered-off even when, for example, pauses occur in a conversation carried by the mobile unit.
As a result, problems can arise when a device configured to operate as part of a DTX-enabled communication system (i.e. a DTX-enabled device) communicates with a device configured to operate as part of a communication system that does not support DTX (i.e. a non-DTX device). For example, a speech signal encoded by a DTX-enabled device and transmitted to a non-DTX device may comprise empty or non-transmittal frames representing pauses in a conversation. These empty or non-transmittal frames, and thus the signal as a whole, may not be properly processed by the non-DTX device since it does not support DTX and is therefore not able to xe2x80x9cfill upxe2x80x9d the dropped frames it receives. When an encoded speech signal is transmitted from a non-DTX device to a DTX-enabled device, on the other hand, the advantages afforded by discontinuous transmission are diminished because the non-DTX device encodes every frame of the signal. In other words, the non-DTX device is not configured to drop any frames and consequently, every frame has to be transmitted across the communication channel, whether it contains actual speech activity or not.
Thus, there is an intense need in the art for a conversion method that can facilitate the communication between DTX-enabled devices and non-DTX devices.
In accordance with the purpose of the present invention as broadly described herein, there are provided methods and systems for converting a speech signal in a speech communication system between a device operating in DTX mode and a device not operating in DTX mode. In one aspect, a frame of a first speech signal comprising a plurality of frames encoded at a plurality of first rates, including a first non-speech rate, is received. Thereafter, the particular rate of the received frame corresponding to one of the plurality of first rates is determined. Subsequently, if it is determined that the received frame is encoded at the first non-speech rate, then the received frame is re-encoded at either a second or third non-speech rate to generate a frame of a second speech signal. In one aspect, a decision is made as to whether the received frame encoded originally at the first non-speech rate is re-encoded at the second or the third non-speech rate. For example, the decision can be based on the characteristics of the received frame. In one aspect, the first non-speech rate is 0.0 Kbps, the second non-speech rate is 0.0 Kbps, and the third non-speech rate is 0.8 Kbps. In another aspect, the first non-speech rate is 0.8 Kbps, the second non-speech rate is 0.0 Kbps, and the third non-speech rate is 0.8 Kbps.
Moreover, a system for converting a first speech signal to a second speech signal comprises a receiver for receiving a frame of the first speech signal, the first speech signal comprising a plurality of frames encoded at a plurality of first rates, including a first non-speech rate. The system further comprises a processor capable of determining the encoding rate of the received frame and capable of encoding the received frame at either a second or third non-speech rate if the processor determines that the received frame was originally encoded at the first non-speech rate.
These and other aspects of the present invention will become apparent with further reference to the drawings and specification, which follow. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the present invention, and be protected by the accompanying claims.