I. Field of the Invention
The present invention pertains generally to the field of speech processing, and more specifically to methods and apparatus for reducing sensitivity to frame error conditions in predictive speech coders.
II. Background
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of a conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved.
Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Speech coders typically comprise an encoder and a decoder. The encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, unquantizes them to produce the parameters, and resynthesizes the speech frames using the unquantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits Ni and the data packet produced by the speech coder has a number of bits No, the compression factor achieved by the speech coder is Cr=Ni/No. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on: (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of No bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
Perhaps most important in the design of a speech coder is the search for a good set of parameters (including vectors) to describe the speech signal. A good set of parameters requires a low system bandwidth for the reconstruction of a perceptually accurate speech signal. Pitch, signal power, spectral envelope (or formants), amplitude and phase spectra are examples of the speech coding parameters.
Speech coders may be implemented as time-domain coders, which attempt to capture the time-domain speech waveform by employing high time-resolution processing to encode small segments of speech (typically 5 millisecond (ms) subframes) at a time. For each subframe, a high-precision representative from a codebook space is found by means of various search algorithms known in the art. Alternatively, speech coders may be implemented as frequency-domain coders, which attempt to capture the short-term speech spectrum of the input speech frame with a set of parameters (analysis) and employ a corresponding synthesis process to recreate the speech waveform from the spectral parameters. The parameter quantizer preserves the parameters by representing them with stored representations of code vectors in accordance with known quantization techniques described in A. Gersho and R. M. Gray, Vector Quantization and Signal Compression (1992).
A well-known time-domain speech coder is the Code Excited Linear Predictive (CELP) coder described in L. B. Rabiner and R. W. Schafer, Digital Processing of Speech Signals 396-453 (1978), which is fully incorporated herein by reference. In a CELP coder, the short term correlations, or redundancies, in the speech signal are removed by a linear prediction (LP) analysis, which finds the coefficients of a short-term formant filter. Applying the short-term prediction filter to the incoming speech frame generates an LP residue signal, which is further modeled and quantized with long-term prediction filter parameters and a subsequent stochastic codebook. Thus, CELP coding divides the task of encoding the time-domain speech waveform into the separate tasks of encoding the LP short-term filter coefficients and encoding the LP residue. Time-domain coding can be performed at a fixed rate (i.e., using the same number of bits, N0, for each frame) or at a variable rate (in which different bit rates are used for different types of frame contents). Variable-rate coders attempt to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain a target quality. An exemplary variable rate CELP coder is described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
Time-domain coders such as the CELP coder typically rely upon a high number of bits, N0, per frame to preserve the accuracy of the time-domain speech waveform. Such coders typically deliver excellent voice quality provided the number of bits, N0, per frame are relatively large (e.g., 8 kbps or above). However, at low bit rates (4 kbps and below), time-domain coders fail to retain high quality and robust performance due to the limited number of available bits. At low bit rates, the limited codebook space clips the waveform-matching capability of conventional time-domain coders, which are so successfully deployed in higher-rate commercial applications. Hence, despite improvements over time, many CELP coding systems operating at low bit rates suffer from perceptually significant distortion typically characterized as noise.
There is presently a surge of research interest and strong commercial need to develop a high-quality speech coder operating at medium to low bit rates (i.e., in the range of 2.4 to 4 kbps and below). The application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms. A low-rate speech coder creates more channels, or users, per allowable application bandwidth, and a low-rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions. An exemplary low-rate speech coder is the prototype pitch period (PPP) speech coder described in U.S. application Ser. No. 09/217,341, entitled VARIABLE RATE SPEECH CODING, filed Dec. 21, 1998, assigned to the assignee of the present invention, and fully incorporated herein by reference.
In conventional predictive speech coders such as the CELP coder, the PPP coder, and the waveform interpolation (WI) coder, the coding scheme relies heavily upon past output. Hence, if a frame error or a frame erasure is received at the decoder, the decoder must create its own best replacement for the frame in question. The decoder typically uses an intelligent frame repeat of the previous output. Because the decoder must create its own replacement, the decoder and the encoder lose synchronization with each other. Therefore, when the next frame arrives at the decoder, if that frame is predictively coded, the decoder refers to different previous output than the encoder used. This causes a reduction in voice quality or speech coder performance. The more heavily the speech coder relies on predictive coding techniques (i.e., the more frames the speech coder encodes predictively), the greater the reduction in performance. Thus, there is a need for a method of reducing sensitivity to frame error conditions in a predictive speech coder.
The present invention is directed to a method of reducing sensitivity to frame error conditions in a predictive speech coder. Accordingly, in one aspect of the invention, a speech coder is provided. The speech coder advantageously includes at least one predictive coding mode; at least one less-predictive coding mode; and a processor coupled to the at least one predictive coding mode and to the at least one less-predictive coding mode, the processor being configured to cause successive speech frames to be coded by selected coding modes in accordance with a pattern of coded speech frames, the pattern including at least one speech frame coded with the less-predictive coding mode.
In another aspect of the invention, a method of coding speech frames is provided. The method advantageously includes the steps of coding a predefined number of successive speech frames with a predictive coding mode; coding at least one speech frame with a less-predictive coding mode after performing the step of coding a predefined number of successive speech frames with a predictive coding mode; and repeating the two coding steps in order to generate a plurality of speech frames coded in accordance with a pattern.
In another aspect of the invention, a speech coder is provided. The speech coder advantageously includes means for coding a predefined number of successive speech frames with a predictive coding mode; means for coding at least one speech frame with a less-predictive coding mode after the predefined number of successive speech frames have been coded with the predictive coding mode; and means for generating a plurality of speech frames coded in accordance with a pattern, the pattern including at least one speech frame coded with a less-predictive coding mode.
In another aspect of the invention, a method of coding speech frames is provided. The method advantageously includes the step of coding a plurality of speech frames in a pattern, the pattern including at least one predictively coded speech frame and at least one less-predictively coded speech frame.
In another aspect of the invention, a method of coding speech frames is provided. The method advantageously includes the step of coding a plurality of speech frames in a pattern, the pattern including at least one heavily predictively coded speech frame and at least one mildly predictively coded speech frame.