1. Field
The disclosed embodiments relate to wireless communications. More particularly, the disclosed embodiments relate to a novel and improved method and apparatus for interoperability between dissimilar voice transmission systems during speech inactivity.
2. Background
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and re-synthesis at the receiver, a significant reduction in the data rate can be achieved. Interoperability of such coding schemes for various types of speech is necessary for communications between different transmission systems. Active speech and non-active speech signals are fundamental types of generated signals. Active speech represents vocalization, while speech inactivity, or non-active speech, typically comprises silence and background noise.
Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Hereinafter, the terms “frame” and “packet” are inter-changeable. Speech coders typically comprise an encoder and a decoder, or a codec. The encoder analyzes the. incoming speech frame to extract certain relevant gain and spectral parameters, and then, quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, de-quantizes them to produce the parameters, and then re-synthesizes the frames using the de-quantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits N and the data packet produced by the speech coder has a number of bits No, the compression factor achieved by the speech coder is Cr=Ni/No. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis, and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of No bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
Speech coders may be implemented as time-domain coders, which attempt to capture the time-domain speech waveform by employing high time-resolution processing to encode small segments of speech (typically 5 millisecond (ms) sub-frames) at a time. For each sub-frame, a high-precision representative from a codebook space is found by means of various search algorithms known in the art. Alternatively, speech coders may be implemented as frequency-domain coders, which attempt to capture the short-term speech spectrum of the input speech frame with a set of parameters (analysis) and employ a corresponding synthesis process to recreate the speech waveform from the spectral parameters. The parameter quantizer preserves the parameters by representing them with stored representations of code vectors in accordance with known quantization techniques described in A. Gersho & R. M. Gray, Vector Quantization and Signal Compression (1992). Different types of speech within a given transmission system may be coded using different implementations of speech coders, and different transmission systems may implement coding of given speech types differently.
For coding at lower bit rates, various methods of spectral, or frequency-domain, coding of speech have been developed, in which the speech signal is analyzed as a time-varying evolution of spectra. See, e.g., R. J. McAulay & T. F. Quatieri, Sinusoidal Coding, in Speech Coding and Synthesis ch. 4 (W. B. Kleijn & K. K. Paliwal eds., 1995). In spectral coders, the objective is to model, or predict, the short-term speech spectrum of each input frame of speech with a set of spectral parameters, rather than to precisely mimic the time-varying speech waveform. The spectral parameters are then encoded and an output frame of speech is created with the decoded parameters. The resulting synthesized speech does not match the original input speech waveform, but offers similar perceived quality. Examples of frequency-domain coders that are well known in the art include multiband excitation coders (MBEs), sinusoidal transform coders (STCs), and harmonic coders (HCs). Such frequency-domain coders offer a high-quality parametric model having a compact set of parameters that can be accurately quantized with the low number of bits available at low bit rates.
In wireless voice communication systems where lower bit rates are desired it is typically also desirable to reduce the level of transmitted power so as to reduce be co-channel interference and to prolong battery life of portable units. Reducing the overall transmitted data rate also serves to reduce the power level of transmitted data. A typical telephone conversation contains approximately 40 per cent speech bursts, and 60 percent silence and background acoustic noise. Background noise carries less perceptual information than speech. Because it is desirable to transmit silence and background noise at the lowest possible bit rate, using the active speech coding-rate during speech inactivity periods is inefficient.
A common approach for exploiting the low voice activity in conversational speech is to use a Voice Activity Detector. (VAD) unit that discriminates between voice and non-voice signals in order to transmit silence, or background noise at reduced data rates. However, coding schemes used by different types of transmission systems, such as Continuous Transmission (CTX) systems and Discontinuous Transmission (DTX) systems are not compatible during transmissions of silence or background noise. In a CTX system, data frames are continuously transmitted, even during periods of speech inactivity. When speech is not present in a DTX system, transmission is discontinued to reduce the overall transmission power. Discontinuous transmission for Global System for Mobile Communications (GSM) systems has been standardized in the European Telecommunications Standard Institute proposals to the International Telecommunications Union (ITU) entitled “Digital Cellular Telecommunication System (Phase 2+); Discontinuous Transmission (DTX) for Enhanced Full Rate (EFR) Speech Traffic Channels”, and “Digital Cellular Telecommunication System (Phase 2+); Discontinuous Transmission (DTX) for Adaptive Multi-Rate (AMR) Speech Traffic Channels”.
CTX systems require a continuous mode of transmission for system synchronization and channel quality monitoring. Thus, when speech is absent, a lower rate coding mode is used to continuously encode the background noise. Code Division Multiple Access (CDMA)-based systems use this approach for variable rate transmission of voice calls. In a CDMA system, eighth rate frames are transmitted during periods of non-activity. 800 bits per second (bps), or 16 bits in every 20 millisecond (ms) frame time, are used to transmit non-active speech. A CTX system, such as CDMA, transmits noise information during voice inactivity for listener comfort as well as synchronization and channel quality measurements. At the receiver side of a CTX communications system, ambient background noise is continuously present during periods of speech non-activity.
In DTX systems, it is not necessary to transmit bits in every 20 ms frame during non-activity. GSM, Wideband CDMA, Voice Over IP systems, and certain satellite systems are DTX systems. In such DTX systems, the transmitter is switched off during periods of speech non-activity. However, at the receiver side of DTX systems, no continuous signal is received during periods of speech non-activity, which causes background noise to be present during active speech, but disappear during periods of silence. The alternating presence and absence of background noise is annoying and objectionable to listeners. To fill the gaps between speech bursts, a synthetic noise known as “comfort noise”, is generated at the receiver side using transmitted noise information. A periodic update of the noise statistics is transmitted using what are known as Silence Insertion Descriptor (SID) frames. Comfort Noise for GSM systems has been standardized in the European Telecommunications Standard Institute proposals to the International Telecommunications Union (ITU) entitled “Digital Cellular Telecommunication System (Phase 2+); Comfort Noise Aspects for Enhanced Full Rate (EFR) Speech Traffic Channels”, and “Digital Cellular Telecommunication System (Phase 2+); Comfort Noise Aspects for Adaptive Multi-Rate (AMR) Speech Traffic Channels”. Comfort noise especially improves listening quality at the receiver when the transmitter is located in noisy environments such as a street, a shopping mall, or a car, etc.
DTX systems compensate for the absence of continuously transmitted noise by generating synthetic comfort noise during periods of inactive speech at the receiver using a noise synthesis model. To generate synthetic comfort noise in DTX systems, one SD frame carrying noise information is transmitted periodically. A periodic DTX representative noise frame; or SID) frame, is typically transmitted once every 20 frame times when the VAD indicates silence.
A model common to both CTX and DTX systems for generating comfort noise at a decoder uses a spectral shaping filter. A random (white) excitation is multiplied by gains and shaped by a spectral shaping filter using received gain and spectral parameters to produce synthetic comfort noise. Excitation gains and spectral information representing spectral shaping are transmitted parameters. In CTX systems, the gain and spectral parameters are encoded at eighth rate and transmitted every frame. In DTX systems, SID frames containing averaged/quantized gain and spectral values are transmitted each period. These differences in coding and transmission schemes for comfort noise cause incompatibility between CTX and DTX transmission systems during periods of non-active speech. Thus, there is a need for interoperability between CTX and DTX voice communications systems that transmit non-voice information.