This invention relates generally to methods and systems for communication of real-time audio, video, and data signals over a packet-switched data network, and more particularly to a method and system for minimizing delay induced by DTMF processing.
FIG. 1 is a diagram of the general topology of a packet telephony system 12. The packet telephony system 12 includes multiple telephone handsets 14 connected to a packet network 18 through gateways 16. The gateways 16 each include a codec for converting audio signals into audio packets and converting the audio packets back into audio signals.
The handsets 14 are traditional telephones or any other device capable of transmitting and/or receiving DTMF signals. Gateways 16 and the codecs used by the gateways 16 are any one of a wide variety of currently commercially available devices used for connecting the handsets 14 to the packet network 118. For example, the gateways 16 can be Voice Over Internet Protocol (VoIP) telephones or personal computers that include a digital signal processor (DSP) and software for encoding audio signals into audio packets. The gateways 16 operate as a transmitting gateway when encoding audio signals into audio packets and transmitting the audio packets over the packet network 18 to a receiving endpoint. The gateways 16 operate as a receiving gateway when receiving audio packets over the packet network 18 and decoding the audio packets back into audio signals. Since packet telephony gateways 16 and codecs are well known, they are not described in further detail.
A conventional packet telephony gateway transmit path is shown in the transmitting gateway in FIG. 2. The transmitting packet gateway 20 includes a voice encoder 22, a packetizer 24, and a transmitter 26. Voice encoder 22 implements the compression half of a codec. Packetizer 24 accepts compressed voice data from encoder 22 and formats the data into packets for transmission. Transmitter 26 places the audio packets from packetizer 24 onto packet network 18.
A receiving packet gateway 24 is shown in FIG. 3. The receiving gateway 24 reverses the process utilized by transmitter 14. A depacketizer 30 accepts packets from packet network 18. A jitter buffer 32 buffers data frames and outputs them to voice decoder 34 in an orderly manner. A voice decoder 34 implements the decompression half of the codec employed by voice encoder 22 (FIG. 2).
Low bit-rate codecs 22, 34 typically model the bandpass filter arrangement of the human auditory system, including the frequency dependence of auditory perception, in allocating bits to different portions of a signal. In essence, low bit-rate encoding often involves many decisions to discard or ignore actual information not typically represented in human speech.
Because it is optimized for human speech, voice encoding can produce undesirable effects if the audio signal being encoded is not of this form. Computer modem and facsimile audio signals are examples of such signals; both can be badly distorted by voice encoding. Modems and facsimile machines employ in-band signaling, i.e., they utilize the audio channel of a telephony connection to convey data to a non-human receiver. However, modem and facsimile traffic do not “share” a voice line with a human speaker. Packet telephony systems can therefore detect such in-band traffic during call connection and switch it to a higher bandwidth, non-voice encoding channel.
Other types of in-band signals share a voice channel with a human speaker. Most common among these are the DTMF (dual-tone multi-frequency) in-band signals generated by a common 12-button telephone keypad. Voice mail, paging, automated information retrieval, and remote control systems are among the wide variety of automated telephony receivers that rely on DTMF in-band control signals keyed in by a human speaker.
Because the signal is carried “in-band” as part of the encoded voice stream, DTMF is poorly encoded by the system shown in FIG. 2 if a low bit-rate coder is used. The reconstructed DTMF signals may be unrecognizable to an automated DTMF receiver. One popular low bit-rate coder, G.723.1, is widely recognized to have very poor DTMF fidelity. Other low bit-rate CODECs also have marginal DTMF fidelity upon decode and are therefore unsuitable without modification for many telephony applications, such as Interactive Voice Response (IVR).
In order to avoid these fidelity problems, more sophisticated packet telephony systems are capable of detecting DTMF in the transmitting gateway in parallel with voice encoding. FIG. 4 depicts a parallel voice-encoding/DTMF detector transmitting packet gateway 38. Transmitting gateway 38 operates a DTMF in-band signal detector 40 on an uncompressed audio data stream 20, in parallel with voice encoder 22. If speech is present in the data stream 20, packetizer 24 will be supplied with a voice-encoded signal from encoder 22. If a DTMF signal appears in the data stream, the DTMF signal, rather than the voice-encoded signal, is supplied separately to packetizer 24. This system allows DTMF signals to effectively bypass the voice codec 22, thereby avoiding DTMF signal distortion. FIG. 4 depicts one of several different schemes where the suppression of the voice is done before packetization.
Although a parallel voice-encoding/DTMF detector packet telephony transmitter 38 can avoid DTMF fidelity problems, this capability comes at the price of higher latency. International Telecommunications Union (ITU) standards specify that a valid DTMF signal be at least 40 milliseconds (ms.) in duration. During the 40 ms. duration of a DTMF pulse, the voice encoder 22 is not allowed to ship frames containing voice-compressed DTMF. Otherwise, the receiver could garble the DTMF signal or identify two signals, the first voice-encoded signal and the second DTMF detector-generated signal.
To avoid this problem, voice encoder 22 delays all speech output by a fixed delay of at least 40 ms. to allow the DTMF detector 40 to detect valid DTMF samples. This delay allows the transmitter to switch smoothly from voice-encoding to DTMF transmission without causing confusion at the receiving packet gateway 24 (FIG. 3). Unfortunately, this same delay adds to the call latency perceived by voice callers utilizing the packet voice connection.
The consequence for end-to-end delay in packet telephony system 12 (FIG. 1) is that all speech must be delayed by a minimum of 40 ms. in the transmitting gateway 38. If this is not done, the receiving gateway would first receive 40 ms. of speech which is actually DTMF, followed after an unpredictable interval by the true DTMF packets. The receiving gateway then plays out one or the other or both, resulting in either garbled DTMF, or possibly a duplicated input such as two “9's” rather than one.
Accordingly, a need remains for accurately detecting and transmitting DTMF without adding additional end-to-end delay to the packet network.