When a speech call is performed over a cellular network, the speech data that is transferred is typically coded into audio frames according to a voice coding algorithm such as one of the coding modes of the Adaptive Multi-Rate (AMR) codec or the Wideband AMR (AMR-WB) codec, the GSM Enhanced Full Rate (EFR) algorithm, or the like. As a result, each of the resulting communication frames transmitted over the wireless link can be seen as a data packet containing a highly compressed representation of the audio for a given time interval.
FIG. 1 provides a simplified schematic diagram of those functional elements of a conventional cellular phone 100 that are generally involved in a speech call, including microphone 50, speaker 60, modem circuits 110, and audio circuits 150. Here, the audio that is captured by microphone 50 is pre-processed in audio pre-processing circuits 180 (which may include, for example, audio processing functions such as filtering, digital sampling, echo cancellation, or the like) and then encoded into a series of audio frames by audio encoder 160, which may implement for example, a standards-based encoding algorithm such as one of the AMR coding modes. The encoded audio frames are then passed to the transmitter (TX) baseband processing circuit 130, which typically performs various standards-based processing tasks (e.g., ciphering, channel coding, multiplexing, modulation, and the like) before transmitting the encoded audio data to a cellular base station via radio frequency (RF) front-end circuits 120.
For audio received from the cellular base station, the modem circuits 110 receive the radio signal from the base station via the RF front-end circuits 120, and demodulates and decodes the received signals with receiver (RX) baseband processing circuits 140. The resulting encoded audio frames produced by the modem circuits 110 are then processed by audio decoder 170 and audio post-processing circuits 190, and the resulting audio signal is passed to the loud speaker 60.
An audio frame typically corresponds to a fixed time interval, such as 20 milliseconds. (Audio frames are transmitted/received on average every 20 milliseconds for all voice call scenarios defined in current versions of the WCDMA and GSM specifications). This means that audio circuits 150 produce one encoded audio frame (for transmission to the network) and consume another (received from the network) every 20 milliseconds, on average, assuming a bi-directional audio link. Typically, these encoded audio frames are transmitted to and received from the communication network at exactly the same rate, although not always. In some cases, for example, two encoded audio frames might be combined to form a single communication frame for transmission over the radio link. In addition, the timing references used to drive the modem circuitry and the audio circuitry may differ, in some situations, in which case a synchronization technique may be needed keep the average rates the same, thus avoiding overflow or underflow of buffers. Several such synchronization techniques are disclosed in U.S. Patent Application Publications 2009/0135976 A1 and 2006/0285557 A1, by Ramakrishnan et al. and Anderton et al., respectively. Furthermore, the exact timing relationship between transmission and reception of the communication frames is generally not fixed, at least at the cellular phone end of the link.
The audio and radio processing pictured in FIG. 1 contribute delays in both directions of audio data transmission—i.e., from the microphone to the remote base station as well as from the remote base station to the speaker. Reducing these delays is an important objective of communications network and device designers.