I. Field of the Invention
The present invention pertains generally to the field of speech processing, and more specifically to methods and apparatus for compensating for frame erasures in variable-rate speech coders.
II. Background
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved.
Devices for compressing speech find use in many fields of telecommunications. An exemplary field is wireless communications. The field of wireless communications has many applications including, e.g., cordless telephones, paging, wireless local loops, wireless telephony such as cellular and PCS telephone systems, mobile Internet Protocol (IP) telephony, and satellite communication systems. A particularly important application is wireless telephony for mobile subscribers.
Various over-the-air interfaces have been developed for wireless communication systems including, e.g., frequency division multiple access (FDMA), time division multiple access (TDMA), and code division multiple access (CDMA). In connection therewith, various domestic and international standards have been established including, e.g., Advanced Mobile Phone Service (AMPS), Global System for Mobile Communications (GSM), and Interim Standard 95 (IS-95). An exemplary wireless telephony communication system is a code division multiple access (CDMA) system. The IS-95 standard and its derivatives, IS-95A, ANSI J-STD-008, IS-95B, proposed third generation standards IS-95C and IS-2000, etc. (referred to collectively herein as IS-95), are promulgated by the Telecommunication Industry Association (TIA) and other well known standards bodies to specify the use of a CDMA over-the-air interface for cellular or PCS telephony communication systems. Exemplary wireless communication systems configured substantially in accordance with the use of the IS-95 standard are described in U.S. Pat. Nos. 5,103,459 and 4,901,307, which are assigned to the assignee of the present invention and fully incorporated herein by reference.
Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Speech coders typically comprise an encoder and a decoder. The encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, unquantizes them to produce the parameters, and resynthesizes the speech frames using the unquantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits Ni and the data packet produced by the speech coder has a number of bits No, the compression factor achieved by the speech coder is Cr=Ni/No. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of No bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
Perhaps most important in the design of a speech coder is the search for a good set of parameters (including vectors) to describe the speech signal. A good set of parameters requires a low system bandwidth for the reconstruction of a perceptually accurate speech signal. Pitch, signal power, spectral envelope (or formants), amplitude spectra, and phase spectra are examples of the speech coding parameters.
Speech coders may be implemented as time-domain coders, which attempt to capture the time-domain speech waveform by employing high time-resolution processing to encode small segments of speech (typically 5 millisecond (ms) subframes) at a time. For each subframe, a high-precision representative from a codebook space is found by means of various search algorithms known in the art. Alternatively, speech coders may be implemented as frequency-domain coders, which attempt to capture the short-term speech spectrum of the input speech frame with a set of parameters (analysis) and employ a corresponding synthesis process to recreate the speech waveform from the spectral parameters. The parameter quantizer preserves the parameters by representing them with stored representations of code vectors in accordance with known quantization techniques described in A. Gersho and R. M. Gray, Vector Quantization and Signal Compression (1992).
A well-known time-domain speech coder is the Code Excited Linear Predictive (CELP) coder described in L. B. Rabiner and R. W. Schafer, Digital Processing of Speech Signals 396-453 (1978), which is fully incorporated herein by reference. In a CELP coder, the short term correlations, or redundancies, in the speech signal are removed by a linear prediction (LP) analysis, which finds the coefficients of a short-term formant filter. Applying the short-term prediction filter to the incoming speech frame generates an LP residue signal, which is further modeled and quantized with long-term prediction filter parameters and a subsequent stochastic codebook. Thus, CELP coding divides the task of encoding the time-domain speech waveform into the separate tasks of encoding the LP short-term filter coefficients and encoding the LP residue. Time-domain coding can be performed at a fixed rate (i.e., using the same number of bits, N0, for each frame) or at a variable rate (in which different bit rates are used for different types of frame contents). Variable-rate coders attempt to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain a target quality. An exemplary variable rate CELP coder is described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
Time-domain coders such as the CELP coder typically rely upon a high number of bits, N0, per frame to preserve the accuracy of the time-domain speech waveform. Such coders typically deliver excellent voice quality provided the number of bits, N0, per frame is relatively large (e.g., 8 kbps or above). However, at low bit rates (4 kbps and below), time-domain coders fail to retain high quality and robust performance due to the limited number of available bits. At low bit rates, the limited codebook space clips the waveform-matching capability of conventional time-domain coders, which are so successfully deployed in higher-rate commercial applications. Hence, despite improvements over time, many CELP coding systems operating at low bit rates suffer from perceptually significant distortion typically characterized as noise.
There is presently a surge of research interest and strong commercial need to develop a high-quality speech coder operating at medium to low bit rates (i.e., in the range of 2.4 to 4 kbps and below). The application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms. A low-rate speech coder creates more channels, or users, per allowable application bandwidth, and a low-rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions.
One effective technique to encode speech efficiently at low bit rates is multimode coding. An exemplary multimode coding technique is described in U.S. application Ser. No. 09/217,341, entitled VARIABLE RATE SPEECH CODING, filed Dec. 21, 1998, assigned to the assignee of the present invention, and fully incorporated herein by reference. Conventional multimode coders apply different modes, or encoding-decoding algorithms, to different types of input speech frames. Each mode, or encoding-decoding process, is customized to optimally represent a certain type of speech segment, such as, e.g., voiced speech, unvoiced speech, transition speech (e.g., between voiced and unvoiced), and background noise (silence, or nonspeech) in the most efficient manner. An external, open-loop mode decision mechanism examines the input speech frame and makes a decision regarding which mode to apply to the frame. The open-loop mode decision is typically performed by extracting a number of parameters from the input frame, evaluating the parameters as to certain temporal and spectral characteristics, and basing a mode decision upon the evaluation.
Coding systems that operate at rates on the order of 2.4 kbps are generally parametric in nature. That is, such coding systems operate by transmitting parameters describing the pitch-period and the spectral envelope (or formants) of the speech signal at regular intervals. Illustrative of these so-called parametric coders is the LP vocoder system.
LP vocoders model a voiced speech signal with a single pulse per pitch period. This basic technique may be augmented to include transmission information about the spectral envelope, among other things. Although LP vocoders provide reasonable performance generally, they may introduce perceptually significant distortion, typically characterized as buzz.
In recent years, coders have emerged that are hybrids of both waveform coders and parametric coders. Illustrative of these so-called hybrid coders is the prototype-waveform interpolation (PWI) speech coding system. The PWI coding system may also be known as a prototype pitch period (PPP) speech coder. A PWI coding system provides an efficient method for coding voiced speech. The basic concept of PWI is to extract a representative pitch cycle (the prototype waveform) at fixed intervals, to transmit its description, and to reconstruct the speech signal by interpolating between the prototype waveforms. The PWI method may operate either on the LP residual signal or on the speech signal. An exemplary PWI, or PPP, speech coder is described in U.S. application Ser. No. 09/217,494, entitled PERIODIC SPEECH CODING, filed Dec. 21, 1998, now U.S. Pat. No. 6,456,964 issued Oct. 24, 2002, assigned to the assignee of the present invention, and fully incorporated herein by reference. Other PWI, or PPP, speech coders are described in U.S. Pat. No. 5,884,253 and W. Bastiaan Kleijn and Wolfgang Granzow Methods for Waveform Interpolation in Speech Coding, in 1 Digital Signal Processing 215-230 (1991).
In most conventional speech coders, the parameters of a given pitch prototype, or of a given frame, are each individually quantized and transmitted by the encoder. In addition, a difference value is transmitted for each parameter. The difference value specifies the difference between the parameter value for the current frame or prototype and the parameter value for the previous frame or prototype. However, quantizing the parameter values and the difference values requires using bits (and hence bandwidth). In a low-bit-rate speech coder, it is advantageous to transmit the least number of bits possible to maintain satisfactory voice quality. For this reason, in conventional low-bit-rate speech coders, only the absolute parameter values are quantized and transmitted. It would be desirable to decrease the number of bits transmitted without decreasing the informational value. Accordingly, a quantization scheme that quantizes the difference between a weighted sum of the parameter values for previous frames and the parameter value for the current frame is described in a related U.S. application Ser. No. 09/557,282, filed Apr. 24, 2000, entitled xe2x80x9cMETHOD AND APPARATUS FOR PREDICTIVELY QUANTIZING VOICED SPEECH,xe2x80x9d assigned to the assignee of the present invention, and fully incorporated herein by reference.
Speech coders experience frame erasure, or packet loss, due to poor channel conditions. One solution used in conventional speech coders was to have the decoder simply repeat the previous frame in the event a frame erasure was received. An improvement is found in the use of an adaptive codebook, which dynamically adjusts the frame immediately following a frame erasure. A further refinement, the enhanced variable rate coder (EVRC), is standardized in the Telecommunication Industry Association Interim Standard EIA/TIA IS-127. The EVRC coder relies upon a correctly received, low-predictively encoded frame to alter in the coder memory the frame that was not received, and thereby improve the quality of the correctly received frame.
A problem with the EVRC coder, however, is that discontinuities between a frame erasure and a subsequent adjusted good frame may arise. For example, pitch pulses may be placed too close, or too far apart, as compared to their relative locations in the event no frame erasure had occurred. Such discontinuities may cause an audible click.
In general, speech coders involving low predictability (such as those described in the paragraph above) perform better under frame erasure conditions. However, as discussed, such speech coders require relatively higher bit rates. Conversely, a highly predictive speech coder can achieve a good quality of synthesized speech output (particularly for highly periodic speech such as voiced speech), but performs worse under frame erasure conditions. It would be desirable to combine the qualities of both types of speech coder. It would further be advantageous to provide a method of smoothing discontinuities between frame erasures and subsequent altered good frames. Thus, there is a need for a frame erasure compensation method that predictive coder performance in the event of frame erasures and smoothes discontinuities between frame erasures and subsequent good frames.
The present invention is directed to a frame erasure compensation method that improves predictive coder performance in the event of frame erasures and smoothes discontinuities between frame erasures and subsequent good frames. Accordingly, in one aspect of the invention, a method of compensating for a frame erasure in a speech coder is provided. The method advantageously includes quantizing a pitch lag value and a delta value for a current frame processed after an erased frame is declared, the delta value being equal to the difference between the pitch lag value for the current frame and a pitch lag value for a frame immediately preceding the current frame; quantizing a delta value for at least one frame prior to the current frame and after the frame erasure, wherein the delta value is equal to the difference between a pitch lag value for the at least one frame and a pitch lag value for a frame immediately preceding the at least one frame; and subtracting each delta value from the pitch lag value for the current frame to generate a pitch lag value for the erased frame.
In another aspect of the invention, a speech coder configured to compensate for a frame erasure is provided. The speech coder advantageously includes means for means for quantizing a pitch lag value and a delta value for a current frame processed after an erased frame is declared, the delta value being equal to the difference between the pitch lag value for the current frame and a pitch lag value for a frame immediately preceding the current frame; means for quantizing a delta value for at least one frame prior to the current frame and after the frame erasure, wherein the delta value is equal to the difference between a pitch lag value for the at least one frame and a pitch lag value for a frame immediately preceding the at least one frame; and means for subtracting each delta value from the pitch lag value for the current frame to generate a pitch lag value for the erased frame.
In another aspect of the invention, a subscriber unit configured to compensate for a frame erasure is provided. The subscriber unit advantageously includes a first speech coder configured to quantize a pitch lag value and a delta value for a current frame processed after an erased frame is declared, the delta value being equal to the difference between the pitch lag value for the current frame and a pitch lag value for a frame immediately preceding the current frame; a second speech coder configured to quantize a delta value for at least one frame prior to the current frame and after the frame erasure, wherein the delta value is equal to the difference between a pitch lag value for the at least one frame and a pitch lag value for a frame immediately preceding the at least one frame; and a control processor coupled to the first and second speech coders and configured to subtract each delta value from the pitch lag value for the current frame to generate a pitch lag value for the erased frame.
In another aspect of the invention, an infrastructure element configured to compensate for a frame erasure is provided. The infrastructure element advantageously includes a processor; and a storage medium coupled to the processor and containing a set of instructions executable by the processor to quantize a pitch lag value and a delta value for a current frame processed after an erased frame is declared, the delta value being equal to the difference between the pitch lag value for the current frame and a pitch lag value for a frame immediately preceding the current frame, quantize a delta value for at least one frame prior to the current frame and after the frame erasure, wherein the delta value is equal to the difference between a pitch lag value for the at least one frame and a pitch lag value for a frame immediately preceding the at least one frame, and subtract each delta value from the pitch lag value for the current frame to generate a pitch lag value for the erased frame.