In certain packet telephony systems, a terminal only transmits when voice activity is present. Such discontinuous transmission (DTX) packet telephony systems allow for greater system capacity, as compared with systems in which a channel is allocated to a transmitting terminal for the duration of the call, or session.
In DTX systems, at the start of each talkspurt, the transmitting device, typically a wireless handset, requests a transmission channel from the base station. The base station, which uses statistical multiplexing for allocating channels, establishes a path via a network and/or intermediate switches to connect to the remote receiving device, which may be another handset, conventional land-line phone, or the like.
The principal functions of the transmitting device and the base station in a DTX system are discussed below. A speaker=s voice is received by an audio input port (AIP) where the voice signal is digitally sampled at some frequency fs, typically fs=8 kHz. The sampled signal is usually divided into frames of length 10 msec or so (i.e., 80 samples) prior to further processing. The frames are input to a voice activity detector (VAD) and a speech encoder. As is known to those skilled in the art, in some devices, the VAD is integrated into the speech encoder, although this is not a requirement in prior art systems. In any event, the VAD determines whether or not speech is present and, if so, sends an active signal to the handset=s control interface. The handset=s control interface sends a traffic channel request over the control channel to the traffic channel manager resident in the base station. In response to the request, the traffic channel manager eventually sends back a traffic channel grant to the handset=s control interface, using the control channel. Upon receiving the traffic channel grant, the handset=s control interface notifies the VAD, the speech encoder and/or the handset=s bit-stream transmitter that a traffic channel has been allocated for transmitting voice data. When this happens, the speech encoder encodes the speech frames and sends the encoded speech signal to the handset=s bit-stream transmitter for transmission over the traffic channel to the appropriate bit-stream receiver associated with the base station. In some devices, the speech encoder prepares frames for transmission and sends these to the bit-stream transmitter, whether or not there is voice information to be transmitted. In such case, the transmitter does not transmit until it receives a signal indicating that the traffic channel is available.
In the above-described conventional system, there is delay between the time that frames emerge from the audio input port and the bit-stream transmitter begins to transmit voice data. The overall delay includes a first delay associated with the time that it takes the VAD to detect that voice activity is present and notify the handset=s control interface prior to the traffic channel request, the AVAD delay@, and a second delay associated, with the time between the traffic channel request and the traffic channel grant, the Achannel access delay@. The length of the VAD delay is fixed for a given handset, and depends on such things as the frame length being used. The length of the channel access delay, however, varies from talkspurt to talkspurt and depends on such factors as the system architecture and the system load. For example, in the wireless voice over EDGE (Enhanced Data for GSM Evolution) system, the channel access delay is approximately 60 msec, and possibly more. Conventionally, mitigating any type of access delay entails either a) buffering the voice bit-stream until permission is granted, and thereby retarding transmission by that amount of time, b) throwing away speech at the beginning of each utterance (Ai.e., Afront-end clipping@) until permission is granted, or c) a combination of the two approaches. The buffering option introduces delay, which is detrimental to the dynamics of interactive conversations. Indeed, adding 120 msec of round trip delay just for access delay can break the overall delay budget for the system. The front-end clipping option often cuts off the initial consonant of each utterance, and thus hurts intelligibility. Finally, combining the two options such that less clipping occurs at the expense of delay is less than satisfactory because such an approach suffers from the disadvantages of both.