I. Field
The present invention relates to communications. More particularly, the present invention relates to a novel and improved method and apparatus for performing variable rate code excited linear predictive (CELP) coding.
II. Description of the Related Art
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information which can be sent over the channel which maintains the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of 64 kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved.
Devices which employ techniques to compress voiced speech by extracting parameters that relate to a model of human speech generation are typically called vocoders. Such devices are composed of an encoder, which analyzes the incoming speech to extract the relevant parameters, and a decoder, which resynthesizes the speech using the parameters which it receives over the transmission channel. In order to be accurate, the model must be constantly changing. Thus the speech is divided into blocks of time, or analysis frames, during which the parameters are calculated. The parameters are then updated for each new frame.
Of the various classes of speech coders the Code Excited Linear Predictive Coding (CELP), Stochastic Coding or Vector Excited Speech Coding are of one class. An example of a coding algorithm of this particular class is described in the paper xe2x80x9cA 4.8 kbps Code Excited Linear Predictive Coderxe2x80x9d by Thomas E. Tremain et al., Proceedings of the Mobile Satellite Conference, 1988.
The function of the vocoder is to compress the digitized speech signal into a low bit rate signal by removing all of the natural redundancies inherent in speech. Speech typically has short term redundancies due primarily to the filtering operation of the vocal tract, and long term redundancies due to the excitation of the vocal tract by the vocal cords. In a CELP coder, these operations are modeled by two filters, a short term formant filter and a long term pitch filter. Once these redundancies are removed, the resulting residual signal can be modeled as white Gaussian noise, which also must be encoded. The basis of this technique is to compute the parameters of a filter, called the LPC filter, which performs short-term prediction of the speech waveform using a model of the human vocal tract. In addition, long-term effects, related to the pitch of the speech, are modeled by computing the parameters of a pitch filter, which essentially models the human vocal chords. Finally, these filters must be excited, and this is done by determining which one of a number of random excitation waveforms in a codebook results in the closest approximation to the original speech when the waveform excites the two filters mentioned above. Thus the transmitted parameters relate to three items (1) the LPC filter, (2) the pitch filter and (3) the codebook excitation.
Although the use of vocoding techniques further the objective in attempting to reduce the amount of information sent over the channel while maintaining quality reconstructed speech, other techniques need be employed to achieve further reduction. One technique previously used to reduce the amount of information sent is voice activity gating. In this technique no information is transmitted during pauses in speech. Although this technique achieves the desired result, of data reduction, it suffers from several deficiencies.
In many cases, the quality of speech is reduced due to clipping of the initial parts of word. Another problem with gating the channel off during inactivity is that the system users perceive the lack of the background noise which normally accompanies speech and rate the quality of the channel as lower than a normal telephone call. A further problem with activity gating is that occasional sudden noises in the background may trigger the transmitter when no speech occurs, resulting in annoying bursts of noise at the receiver.
In an attempt to improve the quality of the synthesized speech in voice activity gating systems, synthesized comfort noise is added during the decoding process. Although some improvement in quality is achieved from adding comfort noise, it does not substantially improve the overall quality since the comfort noise does not model the actual background noise at the encoder.
A preferred technique to accomplish data compression, so as to result in a reduction of information that needs to be sent, is to perform variable rate vocoding. Since speech inherently contains periods of silence, i.e. pauses, the amount of data required to represent these periods can be reduced. Variable rate vocoding most effectively exploits this fact by reducing the data rate for these periods of silence. A reduction in the data rate, as opposed to a complete halt in data transmission, for periods of silence overcomes the problems associated with voice activity gating while facilitating a reduction in transmitted information.
Copending U.S. Pat. No. 5,414,796, issued May 9, 1995, entitled xe2x80x9cVariable Rate Vocoderxe2x80x9d and assigned to the assignee of the present invention and is incorporated by reference herein details a vocoding algorithm of the previously mentioned class of speech coders, Code Excited Linear Predictive Coding (CELP), Stochastic Coding or Vector Excited Speech Coding. The CELP technique by, itself does provide a significant reduction in the amount of data necessary to represent speech in a manner that upon resynthesis results in high quality speech. As mentioned previously the vocoder parameters are updated for each frame. The vocoder detailed in the above-mentioned patent provides a variable output data rate by changing the frequency and precision of the model parameters.
The vocoding algorithm of the above-mentioned patent differs most markedly from the prior CELP techniques by producing a variable output data rate based on speech activity. The structure is defined so that the parameters are updated less often, or with less precision, during pauses in speech. This technique allows for an even greater decrease in the amount of information to be transmitted. The phenomenon which is exploited to reduce the data rate is the voice activity factor, which is the average percentage of time a given speaker is actually talking during a conversation. For typical two-way telephone conversations, the average data rate is reduced by a factor of 2 or more. During pauses in speech, only background noise is being coded by the vocoder. At these times, some of the parameters relating to the human vocal tract model need not be transmitted.
As mentioned previously a prior approach to limiting the amount of information transmitted during silence is called voice activity gating, a technique in which no information is transmitted during moments of silence. On the receiving side the period may be filled in with synthesized xe2x80x9ccomfort noisexe2x80x9d. In contrast, a variable rate vocoder is continuously transmitting data which, in the exemplary embodiment of the above-mentioned patent, is at rates which range between approximately 8 kbps and 1 kbps. A vocoder which provides a continuous transmission of data eliminates the need for synthesized xe2x80x9ccomfort noisexe2x80x9d, with the coding of the background noise providing a more natural quality to the synthesized speech. The invention of the aforementioned patent therefore provides a significant improvement in synthesized speech quality over that of voice activity gating by allowing a smooth transition between speech and background.
The vocoding algorithm of the above mentioned patent enables short pauses in speech to be detected, a decrease in the effective voice activity factor is realized. Rate decisions can be made on a frame by frame basis with no hangover, so the data rate may be lowered for pauses in speech as short as the frame duration, typically 20 msec. Therefore pauses such as those between syllables may be captured. This technique decreases the voice activity factor beyond what has traditionally been considered, as not only long duration pauses between phrases, but also shorter pauses can be encoded at lower rates.
Since rate decisions are made on a frame basis, there is no clipping of the initial part of the word, such as in a voice activity gating system. Clipping of this nature occurs in voice activity gating system due to a delay between detection of the speech and a restart in transmission of data. Use of a rate decision based upon each frame results in speech where all transitions have a natural sound.
With the vocoder always transmitting, the speaker""s ambient background noise will continually be heard on the receiving end thereby yielding a more natural sound during speech pauses. The present invention thus provides a smooth transition to background noise. What the listener hears in the background during speech will not suddenly change to a synthesized comfort noise during pauses as in a voice activity gating system.
Since background noise is continually vocoded for transmission, interesting events in the background can be sent with full clarity. In certain cases the interesting background noise may even be coded at the highest rate. Maximum rate coding may occur, for example, when there is someone talking loudly in the background, or if an ambulance drives by a user standing on a street corner. Constant or slowly varying background noise will, however, be encoded at low rates.
The use of variable rate vocoding has the promise of increasing the capacity of a Code Division Multiple Access (CDMA) based digital cellular telephone system by more than a factor of two. CDMA and variable rate vocoding are uniquely matched, since, with CDMA, the interference between channels drops automatically as the rate of data transmission over any channel decreases. In contrast, consider systems in which transmission slots are assigned, such as TDMA or FDMA. In order for such a system to take advantage of any drop in the rate of data transmission, external intervention is required to coordinate the reassignment of unused slots to other users. The inherent delay in such a scheme implies that the channel may be reassigned only during long speech pauses. Therefore, full advantage cannot be taken of the voice activity factor. However, with external coordination, variable rate vocoding is useful in systems other than CDMA because of the other mentioned reasons.
In a CDMA system speech quality can be slightly degraded at times when extra system capacity is desired. Abstractly speaking, the vocoder can be thought of as multiple vocoders all operating at different rates with different resultant speech qualities. Therefore the speech qualities can be mixed in order to further reduce the average rate of data transmission. Initial experiments show that by mixing full and half rate vocoded speech, e.g. the maximum allowable data rate is varied on a frame by frame basis between 8 kbps and 4 kbps, the resulting speech has a quality which is better than half rate variable, 4 kbps maximum, but not as good as full rate variable, 8 kbps maximum.
It is well known that in most telephone conversations, only one person talks at a time. As an additional function for full-duplex telephone links a rate interlock may be provided. If one direction of the link is transmitting at the highest transmission rate, then the other direction of the link is forced to transmit at the lowest rate. An interlock between the two directions of the link can guarantee no greater than 50% average utilization of each direction of the link. However, when the channel is gated off, such as the case for a rate interlock in activity gating, there is no way for a listener to interrupt the talker to take over the talker role in the conversation. The vocoding method of the above mentioned patent readily provides the capability of an adaptive rate interlock by control signals which set the vocoding rate.
In the above-mentioned patent the vocoder operates at either full rate when speech is present or eighth rate when speech is not present. The operation of the vocoding algorithm at half and quarter rates is reserved for special conditions of impacted capacity or when other data is to be transmitted in parallel with speech data.
U.S. Pat. No. 5,857,147, issued Jan. 5, 1999, entitled xe2x80x9cMethod and Apparatus for Determining the Transmission Data Rate in a Multi-User Communication Systemxe2x80x9d and assigned to the assignee of the present invention and is incorporated by reference herein details a method by which a communication system in accordance with system capacity measurements limits the average data rate of frames encoded by a variable rate vocoder. The system reduces the average data rate by forcing predetermined frames in a string of full rate frames to be coded at a lower rate, i.e. half rate. The problem with reducing the encoding rate for active speech frames in this fashion is that the limiting does not correspond to any characteristics of the input speech and so is not optimized for speech compression quality.
Also, in U.S. Pat. No. 5,341,456, issued Aug. 23, 1994, entitled xe2x80x9cImproved Method for Determining Speech Encoding Rate in a Variable Rate Vocoderxe2x80x9d, and assigned to the assignee of the present invention and is incorporated by reference herein, a method for distinguishing unvoiced speech from voiced speech is disclosed. The method disclosed examines the energy of the speech and the spectral tilt of the speech and uses the spectral tilt to distinguish unvoiced speech from background noise.
Variable rate vocoders that vary the encoding rate based entirely on the voice activity of the input speech fail to realize the compression efficiency of a variable rate coder that varies the encoding rate based on the complexity or information content that is dynamically varying during active speech. By matching the encoding rates to the complexity of the input waveform more efficient speech coders can be built. Furthermore, systems that seek to dynamically adjust the output data rate of the variable rate vocoders should vary the data rates in accordance with characteristics of the input speech to attain an optimal voice quality for a desired average data rate.
The present invention is a novel and improved method and apparatus for encoding active speech frames at a reduced data rate by encoding speech frames at rates between a predetermined maximum rate and a predetermined minimum rate. The present invention designates a set of active speech operation modes. In the exemplary embodiment of the present invention, there are four active speech operation modes, full rate speech, half rate speech, quarter rate unvoiced speech and quarter rate voiced speech.
It is an objective of the present invention to provide an optimized method for selecting an encoding mode that provides rate efficient coding of the input speech. It is a second objective of the present invention to identify a set of parameters ideally suited for this operational mode selection and to provide a means for generating this set of parameters. Third, it is an objective of the present invention to provide identification of two separate conditions that allow low rate coding with minimal sacrifice to quality. The two conditions are the presence of unvoiced speech and the presence of temporally masked speech. It is a fourth objective of the present invention to provide a method for dynamically adjusting the average output data rate of the speech coder with minimal impact on speech quality.
The present invention provides a set of rate decision criteria referred to as mode measures. A first mode measure is the target matching signal to noise ratio (TMSNR) from the previous encoding frame, which provides information on how well the synthesized speech matches the input speech or, in other words, how well the encoding model is performing. A second mode measure is the normalized autocorrelation function (NACF), which measures periodicity in the speech frame. A third mode measure is the zero crossings (ZC) parameter which is a computationally inexpensive method for measuring high frequency content in an input speech frame. A fourth measure is the prediction gain differential (PGD) which determines if the LPC model is maintaining its prediction efficiency. The fifth measure is the energy differential (ED) which compares the energy in the current frame to an average frame energy.
The exemplary embodiment of the vocoding algorithm of the present invention uses the five mode measures enumerated above to select an encoding mode for an active speech frame. The rate determination logic of the present invention compares the NACF against a first threshold value and the ZC against a second threshold value to determine if the speech should be coded as unvoiced quarter rate speech.
If it is determined that the active speech frame contains voiced speech, then the vocoder examines the parameter ED to determine if the speech frame should be coded as quarter rate voiced speech. If it is determined that the speech is not to be coded at quarter rate, then the vocoder tests if the speech can be coded at half rate. The vocoder tests the values of TMSNR, PGD and NACF to determine if the speech frame can be coded at half rate. If it is determined that the active speech frame cannot be coded at quarter or half rates, then the frame is coded at full rate.
It is further an objective to provide a method for dynamically changing threshold values in order to accommodate rate requirements. By varying one or more of the mode selection thresholds it is possible to increase or decrease the average data transmission rate. So by dynamically adjusting the threshold values an output rate can be adjusted.