1. Technical Field
This invention relates to speech communication systems and, more particularly, to systems for digital speech coding.
2. Related Art
One prevalent mode of human communication is by the use of communication systems. Communication systems include both wireline and wireless radio based systems. Wireless communication systems are electrically connected with the wireline based systems and communicate with the mobile communication devices using radio frequency (RF) communication. Currently, the radio frequencies available for communication in cellular systems, for example, are in the cellular frequency range centered around 900 MHz and in the personal communication services (PCS) frequency range centered around 1900 MHz. Data and voice transmissions within the wireless system have a bandwidth that consumes a portion of the radio frequency. Due to increased traffic caused by the expanding popularity of wireless communication devices, such as cellular telephones, it is desirable to reduced bandwidth of transmissions within the wireless systems.
Digital transmission in wireless radio communications is increasingly applied to both voice and data due to noise immunity, reliability, compactness of equipment and the ability to implement sophisticated signal processing functions using digital techniques. Digital transmission of speech signals involves the steps of: sampling an analog speech waveform with an analog-to-digital converter, speech compression (encoding), transmission, speech decompression (decoding), digital-to-analog conversion, and playback into an earpiece or a loudspeaker. The sampling of the analog speech waveform with the analog-to-digital converter creates a digital signal. However, the number of bits used in the digital signal to represent the analog speech waveform creates a relatively large bandwidth. For example, a speech signal that is sampled at a rate of 8000 Hz (once every 0.125 ms), where each sample is represented is by 16 bits, will result in a bit rate of 128,000 (16xc3x978000) bits per second, or 128 Kbps (Kilobits per second).
Speech compression may be used to reduce the number of bits that represent the speech signal thereby reducing the bandwidth needed for transmission. However, speech compression may result in degradation of the quality of decompressed speech. In general, a higher bit rate will result in higher quality, while a lower bit rate will result in lower quality. However, modern speech compression techniques, such as coding techniques, can produce decompressed speech of relatively high quality at relatively low bit rates. In general, modern coding techniques attempt to represent the perceptually important features of the speech signal, without preserving the actual speech waveform.
One coding technique used to lower the bit rate involves varying the degree of speech compression (i.e. varying the bit rate) depending on the part of the speech signal being compressed. Typically, parts of the speech signal for which adequate perceptual representation is more difficult (such as voiced speech, plosives, or voiced onsets) are coded and transmitted using a higher number of bits. Conversely, parts of the speech for which adequate perceptual representation is less difficult (such as unvoiced, or the silence between words) are coded with a lower number of bits. The resulting average bit rate for the speech signal will be relatively lower than would be the case for a fixed bit rate that provides decompressed speech of similar quality.
Speech compression systems, commonly called codecs, include an encoder and a decoder and may be used to reduce the bit rate of digital speech signals. Numerous algorithms have been developed for speech codecs that reduce the number of bits required to digitally encode the original speech while attempting to maintain high quality reconstructed speech. Code-Excited Linear Predictive (CELP) coding techniques, as discussed in the article entitled xe2x80x9cCode-Excited Linear Prediction: High-Quality Speech at Very Low Rates,xe2x80x9d by M. R. Schroeder and B. S. Atal, Proc. ICASSP-85, pages 937-940, 1985, provide one effective speech coding algorithm. An example of a variable rate CELP based speech coder is TIA (Telecommunications Industry Association) IS-127 standard that is designed for CDMA (Code Division Multiple Access) applications. The CELP coding technique utilizes several prediction techniques to remove the redundancy from the speech signal. The CELP coding approach is frame-based in the sense that it stores sampled input speech signals into a block of samples called frames. The frames of data may then be processed to create a compressed speech signal in digital form.
The CELP coding approach uses two types of predictors, a short-term predictor and a long-term predictor. The short-term predictor typically is applied before the long-term predictor. A prediction error derived from the short-term predictor is commonly called short-term residual, and a prediction error derived from the long-term predictor is commonly called long-term residual. The long-term residual may be coded using a fixed codebook that includes a plurality of fixed codebook entries or vectors. One of the entries may be selected and multiplied by a fixed codebook gain to represent the long-term residual. The short-term predictor also can be referred to as an LPC (Linear Prediction Coding) or a spectral representation, and typically comprises 10 prediction parameters. The long-term predictor also can be referred to as a pitch predictor or an adaptive codebook and typically comprises a lag parameter and a long-term predictor gain parameter. Each lag parameter also can be called a pitch lag, and each long-term predictor gain parameter can also be called an adaptive codebook gain. The lag parameter defines an entry or a vector in the adaptive codebook.
The CELP encoder performs an LPC analysis to determine the short-term predictor parameters. Following the LPC analysis, the long-term predictor parameters may be determined. In addition, determination of the fixed codebook entry and the fixed codebook gain that best represent the long-term residual occurs. The powerful concept of analysis-by-synthesis (ABS) is employed in CELP coding. In the ABS approach, the best contribution from the fixed codebook, the best fixed codebook gain, and the best long-term predictor parameters may be found by synthesizing them using an inverse prediction filter and applying a perceptual weighting measure. The short-term (LPC) prediction coefficients, the fixed-codebook gain, as well as the lag parameter and the long-term gain parameter may then be quantized. The quantization indices, as well as the fixed codebook indices, may be sent from the encoder to the decoder.
The CELP decoder uses the fixed codebook indices to extract a vector from the fixed codebook. The vector may be multiplied by the fixed-codebook gain, to create a long-term excitation also known as a fixed codebook contribution. A long-term predictor contribution may be added to the long-term excitation to create a short-term excitation that commonly is referred to simply as an excitation. The long-term predictor contribution comprises the short-term excitation from the past multiplied by the long-term predictor gain. The addition of the long-term predictor contribution alternatively can be viewed as an adaptive codebook contribution or as a long-term (pitch) filtering. The short-term excitation may be passed through a short-term inverse prediction filter (LPC) that uses the short-term (LPC) prediction coefficients quantized by the encoder to generate synthesized speech. The synthesized speech may then be passed through a post-filter that reduces perceptual coding noise.
These speech compression techniques have resulted in lowering the amount of bandwidth used to transmit a speech signal. However, further reduction in bandwidth is particular important in a communication system that has to allocate its resources to a large number of users. Accordingly, there is a need for systems and methods of speech coding that are capable of minimizing the average bit rate needed for speech representation, while providing high quality decompressed speech.
This invention provides systems for encoding and decoding speech signals. The embodiments may use the CELP coding technique and prediction based coding as a framework to employ signal-processing functions using waveform matching and perceptual related techniques. These techniques allow the generation of synthesized speech that closely resembles the original speech by including perceptual features while maintaining a relatively low bit rate. One application of the embodiments is in wireless communication systems. In this application, the encoding of original speech, or the decoding to generate synthesized speech, may occur at mobile communication devices. In addition, encoding and decoding may occur within wireline-based systems or within other wireless communication systems to provide interfaces to wireline-based systems.
One embodiment of a speech compression system includes a full-rate codec, a half-rate codec, a quarter-rate codec and an eighth-rate codec each capable of encoding and decoding speech signals. The full-rate, half-rate, quarter-rate and eighth-rate codecs encode the speech signals at bit rates of 8.5 Kbps, 4 Kbps, 2 Kbps and 0.8 Kbps, respectively. The speech compression system performs a rate selection on a frame of a speech signal to select one of the codecs. The rate selection is performed on a frame-by-frame basis. Frames are created by dividing the speech signal into segments of a finite length of time. Since each frame may be coded with a different bit rate, the speech compression system is a variable-rate speech compression system that codes the speech at an average bit rate.
The rate selection is determined by characterization of each frame of the speech signal based on the portion of the speech signal contained in the particular frame. For example, frames may be characterized as stationary voiced, non-stationary voiced, unvoiced, background noise, silence etc. In addition, the rate selection is based on a Mode that the speech compression system is operating within. The different Modes indicate the desired average bit rate. The codecs are designed for optimized coding within the different characterizations of the speech signals. Optimal coding balances the desire to provide synthesized speech of the highest perceptual quality while maintaining the desired average bit rate, thereby maximizing use of the available bandwidth. During operation, the speech compression system selectively activates the codecs based on the Mode as well as characterization of the frame in an attempt to optimize the perceptual quality of the synthesized speech.
Once the full or the half-rate codec is selected by the rate selection, a type classification of the speech signal occurs to further optimize coding. The type classification may be a first type (i.e. a Type One) for frames containing a harmonic structure and a formant structure that do not change rapidly or a second type (i.e. a Type Zero) for all other frames. The bit allocation of the full-rate and half-rate codecs may be adjusted in response to the type classification to further optimize the coding of the frame. The adjustment of the bit allocation provides improved perceptual quality of the reconstructed speech signal by emphasizing different aspects of the speech signal within each frame.
Accordingly, the speech coder is capable of selectively activating the codecs to maximize the overall quality of a reconstructed speech signal while maintaining the desired average bit rate. Other systems, methods, features and advantages of the invention will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.