In certain packet telephony systems, a terminal only transmits when voice activity is present. Such discontinuous transmission (DTX) packet telephony systems allow for greater system capacity, as compared with systems in which a channel is allocated to a transmitting terminal for the duration of the call, or session.
With reference to FIG. 1, in DTX systems, at the start of each talkspurt, the transmitting device 102, typically a wireless handset, requests a transmission channel from the base station 104. The base station 104, which uses statistical multiplexing for allocating channels, establishes a path via a network 106 and/or intermediate switches 108 to connect to the remote receiving device 110, which may be another handset, conventional land-line phone, or the like.
FIG. 2 presents a block diagram of the principal functions of the transmitting device 102 and the base station 104 in a DTX system. A speaker's voice is received by an audio input port (AIP) 122 where the voice signal is digitally sampled at some frequency fs, typically fs=8 kHz. The sampled signal is usually divided into frames of length 10 msec or so (i.e., 80 samples) prior to further processing. The frames are input to a voice activity detector (VAD) 124 and a speech encoder 126. As is known to those skilled in the art, in some devices, the VAD 124 is integrated into the speech encoder 126, although this is not a requirement in prior art systems. In any event, the VAD 124 determines whether or not speech is present and, if so, sends an active signal to the handset's control interface 128. The handset's control interface 128 sends a traffic channel request over the control channel 130 to the traffic channel manager 132 resident in the base station 104. In response to the request, the traffic channel manager 132 eventually sends back a traffic channel grant to the handset=s control interface 128, using the control channel 130. Upon receiving the traffic channel grant, the handset's control interface notifies the VAD 124, the speech encoder 126 and/or the handset's bit-stream transmitter 134 that a traffic channel 136 has been allocated for transmitting voice data. When this happens, the speech encoder 126 encodes the speech frames and sends the encoded speech signal to the handset's bit-stream transmitter 134 for transmission over the traffic channel 136 to the appropriate bit-stream receiver 138 associated with the base station 104. In some devices, the speech encoder 126 prepares frames for transmission and sends these to the bit-stream transmitter, whether or not there is voice information to be transmitted. In such case, the transmitter does not transmit until it receives a signal indicating that the traffic channel 136 is available.
In the above-described conventional system, there is delay between the time that frames emerge from the audio input port and the bit-stream transmitter 134 begins to transmit voice data. The overall delay includes a first delay associated with the time that it takes the VAD to detect that voice activity is present and notify the handset's control interface prior to the traffic channel request, the VAD delay, and a second delay associated, with the time between the traffic channel request and the traffic channel grant, the channel access delay. The length of the VAD delay is fixed for a given handset, and depends on such things as the frame length being used. The length of the channel access delay, however, varies from talkspurt to talkspurt and depends on such factors as the system architecture and the system load. For example, in the wireless voice over EDGE (Enhanced Data for GSM Evolution) system, the channel access delay is approximately 60 msec, and possibly more. Conventionally, mitigating any type of access delay entails either a) buffering the voice bit-stream until permission is granted, and thereby retarding transmission by that amount of time, b) throwing away speech at the beginning of each utterance ((i.e., front-end clipping until permission is granted, or c) a combination of the two approaches. The buffering option introduces delay, which is detrimental to the dynamics of interactive conversations. Indeed, adding 120 msec of round trip delay just for access delay can break the overall delay budget for the system. The front-end clipping option often cuts off the initial consonant of each utterance, and thus hurts intelligibility. Finally, combining the two options such that less clipping occurs at the expense of delay is less than satisfactory because such an approach suffers from the disadvantages of both.