The present invention relates to a system and method for performing forward error correction in the transmission of audio information, and more particularly, to a system and method for performing forward error correction in packet-based transmission of speech-coded information.
The shortcomings of state-of-the-art forward error correction (FEC) techniques can best be appreciated by an introductory discussion of some conventional speech coding concepts.
1.1 Code-Excited Linear Predictive (CELP) Coding
FIG. 1 shows a conventional code-excited linear predictive (CELP) analysis-by-synthesis encoder 100. The encoder 100 includes functional units designated as framing module 104, linear prediction coding (LPC) analysis module 106, difference calculating module 118, error weighting module 114, error minimization module 116, and decoder module 102. The decoder module 102, in turn, includes a fixed codebook 112, a long-term predictor (LTP) filter 110, and a linear predictor coding (LPC) filter 108 connected together in cascaded relationship to produce a synthesized signal ŝ(n). The LPC filter 108 models the short-term correlation in the speech attributed to the vocal tracts, corresponding to the spectral envelope of the speech signal. It is be represented by:                                           1            /                          A              ⁡                              (                z                )                                              =                      1            /                          (                              1                -                                                      ∑                                          i                      =                      1                                        p                                    ⁢                                      xe2x80x83                                    ⁢                                                            a                      i                                        ⁢                                          z                                              -                        i                                                                                                        )                                      ,                            (                  Eq          .                      xe2x80x83                    ⁢          1                )            
where p denotes the filter order and ai denotes the filter coefficients. The LTP filter 110, on the other hand, models the long-term correlation of the speech attributed to the vocal cords, corresponding to the fine periodic-like spectral structure of the speech signal. For example, it can have the form given by:                                           1            /                          P              ⁡                              (                z                )                                              =                      1            /                          (                              1                -                                                      ∑                                          i                      =                                              -                        1                                                              1                                    ⁢                                      xe2x80x83                                    ⁢                                                            b                      i                                        ⁢                                          z                                              -                                                  (                                                      D                            +                            i                                                    )                                                                                                                                )                                      ,                            (                  Eq          .                      xe2x80x83                    ⁢          2                )            
where D generally corresponds to the pitch period of the long-term correlation, and bi pertains to the filter""s long-term gain coefficients. The fixed codebook 112 stores a series of excitation input sequences. The sequences provide excitation signals to the LTP filter 110 and LPC filter 108, and are useful in modeling characteristics of the speech signal which cannot be predicted with deterministic methods using the LTP filter 110 and LPC filter 108, such as audio components within music, to some degree.
In operation, the framing module 104 receives an input speech signal and divides it into successive frames (e.g., 20 ms in duration). Then, the LPC analysis module 106 receives and analyzes a frame to generate a set of LPC coefficients. These coefficients are used by the LPC filter 108 to model the short-term characteristics of the speech signal corresponding to its spectral envelope. An LPC residual can then be formed by feeding the input speech signal through an inverse filter including the calculated LPC coefficients. This residual, shown in FIG. 2, represents a component of the original speech signal that remains after removal of the short-term redundancy by linear predictive analysis. The distance between two pitch pulses is denoted xe2x80x9cLxe2x80x9d and is called the lag. The encoder 100 can then use the residual to predict the long-term coefficients. These long-term coefficients are used by the LTP filter 110 to model the fine spectral structure of the speech signal (such as pitch delay and pitch gain). Taken together, the LTP filter 110 and the LPC filter 108 form a cascaded filter which models the long-term and short-term characteristics of the speech signal. When driven by an excitation sequence from the fixed codebook 112, the cascaded filter generates the synthetic speech signal ŝ(n) which represents a reconstructed version of the original speech signal s(n).
The encoder 100 selects an optimum excitation sequence by successively generating a series of synthetic speech signals ŝ(n), successively comparing the synthetic speech signals ŝ(n) with the original speech signals s(n), and successively adjusting the operational parameters of the decoder module 102 to minimize the difference between ŝ(n) and s(n). More specifically, the difference calculating module 118 forms the difference (i.e., the error signal e(n)) between the original speech signal s(n) and the synthetic speech signal ŝ(n). An error weighting module 114 receives the error signal e(n) and generates a weighted error signal ew(n) based on perceptual weighting factors. The error minimization module 116 uses a search procedure to adjust the operational parameters of the speech decoder 102 such that it produces a synthesized signal ŝ(n) which is closest to the original signal s(n) as possible.
Upon arriving at an optimum synthesized signal ŝ(n), relevant encoder parameters are transferred over a transmission medium (not shown) to a decoder site (not shown). A decoder at the decoder site includes an identical construction to the decoder module 102 of the encoder 100. The decoder uses the transferred parameters to reproduce the optimized synthesized signal ŝ(n) calculated in the encoder 100. For instance, the encoder 100 can transfer codebook indices representing the location of the optimal excitation signal in the fixed codebook 112, together with relevant filter parameters or coefficients (e.g., the LPC and LTP parameters). The transfer of the parameters in lieu of a more direct representation of the input speech signal provides notable reduction in the bandwidth required to transmit speech information.
FIG. 3 shows a modification of the analysis-by-synthesis encoder 100 shown in FIG. 1. The encoder 300 shown in FIG. 3 includes a framing module 304, LPC analysis module 306, LPC filter 308, difference calculating module 318, error weighting module 314, error minimization module 316, and fixed codebook 312. Each of these units generally corresponds to the like-named parts shown in FIG. 1. In FIG. 3, however, the LTP filter 110 is replaced by the adaptive codebook 320. Further, an adder module 322 adds the excitation signals output from the adaptive codebook 320 and the fixed codebook 312.
The encoder 300 functions basically in the same manner as the encoder 100 of FIG. 1. In the encoder 300, however, the adaptive codebook 320 models the long-term characteristics of the speech signal. Further, the excitation signal applied to the LPC filter 308 represents a summation of an adaptive codebook 320 entry and a fixed codebook 312 entry.
1.2 GSM Enhanced Full Rate Coding (GSM-EFR)
The prior art provides numerous specific implementations of the above-described CELP design. One such implementation is the GSM Enhanced Full Rate (GSM-EFR) speech transcoding standard described in the European Telecommunication Standard Institute""s (ETSI) xe2x80x9cGlobal System for Mobile Communications: Digital Cellular Telecommunications Systems: Enhanced full Rate (EFR) Speech Transcoding (GSM 06.60),xe2x80x9d November 1996, which is incorporated by reference herein in its entirety.
The GSM-EFR standard models the short-term properties of the speech signal using:                                           H            ⁡                          (              z              )                                =                                    1              /                                                A                  ^                                ⁡                                  (                  z                  )                                                      =                          1              /                              (                                  1                  +                                                            ∑                                              i                        =                        1                                            m                                        ⁢                                          xe2x80x83                                        ⁢                                                                                            a                          ^                                                i                                            ⁢                                              z                                                  -                          i                                                                                                                    )                                                    ,                            (                  Eq          .                      xe2x80x83                    ⁢          3                )            
where xc3xa2i represents the quantified linear prediction parameters. The standard models the long-term features of the speech signal with:
1/B(z)=1/(1xe2x88x92gpzxe2x88x92T)xe2x80x83xe2x80x83(Eq. 4),
where T pertains to the pitch delay and gp pertains to the pitch gain. An adaptive codebook implements the pitch synthesis. Further, the GSM-EFR standard uses a perceptual weighting filter defined by:
W(z)=(A(z/xcex31))/(A(z/xcex32))xe2x80x83xe2x80x83(Eq. 5),
where A(z) defines the unquantized LPC filter, and xcex31 and xcex32 represent perceptual weighting factors. Finally, the GSM-EFR standard uses adaptive and fixed (innovative) codebooks to provide an excitation signal. In particular, the fixed codebook forms an algebraic codebook structured based on an interleaved single-pulse permutation (ISPP) design. The excitation vectors consist of a fixed number of mathematically calculated pulses different from zero. An excitation is specified by selected pulse positions and signs within the codebook.
In operation, the GSM-EFR encoder divides the input speech signal into 20 ms frames, which, in turn, are divided into four 5 ms subframes. The encoder then performs LPC analysis twice per frame. More specifically, the GSM-EFR encoder uses an auto-correlation approach with 30 ms asymmetric windows to calculate the short-term parameters. No look-ahead is employed in the LPC analysis. Look-ahead refers to the use of samples from a future frame in performing analysis.
Each LP coefficient is then converted to Linear Spectral Pair (LSP) representation for quantization and interpolation using an LSP predictor. LSP analysis maps the filter coefficients onto a unit circle in the range of xe2x88x92xcfx80 to xcfx80 to produce Line Spectral Frequency (LSF) values. The use of LSF values provides better robustness and stability against bit errors compared to the use of LPC values. Further, the use of LSF values enables a more efficient quantization of information compared to the use of LPC values. GSM-EFR specifically uses the following predictor equation to calculate a residual that is then quantized:
xe2x80x83LSFres=LSFxe2x88x92LSFmeanxe2x88x92predFactorxc2x7LSFprev,resxe2x80x83xe2x80x83(Eq. 6).
The term LSFres refers to an LSF residual vector for a frame n. The quantity (LSFxe2x88x92LSFmean) defines a mean-removed LSF vector at frame n. The term (predFactorxc2x7LSFprev,res) refers to a predicted LSF vector at frame n, wherein predFactor refers to a prediction factor constant and LSFprev,res refers to a second residual vector from the past frame (i.e., frame nxe2x88x921). The decoder uses the inverse process, as per Eq. 7 below:
LSF=LSFres+LSFmean+predFactorxc2x7LSFprev,resxe2x80x83xe2x80x83(Eq. 7).
To achieve the predicted result, the previous residual LSFprev,res in the decoder must have the correct value. After reconstruction, the coefficients are converted into direct filter form, and used when synthesizing the speech.
The encoder then executes so-called open-loop pitch analysis to estimate the pitch lag in each half of the frame (every 10 ms) based on the perceptually weighted speech signal. Thereafter, the encoder performs a number of operations on each subframe. More specifically, the encoder computes a target signal x(n) by subtracting the zero input response of the weighted synthesis filter W(z)H(z) from the weighted speech signal. Then the encoder computes an impulse response h(n) of the weighted synthesis filter. The encoder uses the impulse response h(n) to perform so-called closed-loop analysis to find pitch lag and gain. Closed-loop search analysis involves minimizing the mean-square weighted error between the original and synthesized speech. The closed-loop search uses the open-loop lag computation as an initial estimate. Thereafter, the encoder updates the target signal x(n) by removing adaptive codebook contribution, and the encoder uses the resultant target to find an optimum innovation vector within the algebraic codebook. The relevant parameters of the codebooks are then scalar quantified using a codebook predictor and the filter memories are updated using the determined excitation signal for finding the target signal in the next subframe.
The encoder transmits two sets of LSP coefficients (comprising 38 bits), pitch delay parameters (comprising 30 bits), pitch gain parameters (comprising 16 bits), algebraic code parameters (comprising 140 bits), and codebook gain parameters (comprising 20 bits). The decoder receives these parameters and reconstructs the synthesized speech by duplicating the encoder conditions represented by the transmitted parameters.
1.3 Error Concealment (EC) in GSM-EFR Coding
The European Telecommunication Standard Institute (ETSI) proposes error concealment for use in GSM-EFR in xe2x80x9cDigital Cellular Telecommunications System: Substitution and Muting of Lost Frames for Enhanced Full Rate (EFR) Speech Traffic Channels (GSM 06.61),xe2x80x9d version 5.1.2, April 1997, which is incorporated herein by reference in its entirety. The referenced standard proposes an exemplary state machine having seven states, 0 through 6. A Bad Frame Indication (BFI) flag indicates whether the current speech frame contains an error (state=0 for no errors, and state=1 for errors). A Previous Bad Frame Indication (PrevBFI) flag indicates whether the previous speech frame contained errors (state=0 for no errors, and state=1 for errors). State 0 corresponds to a state in which both the current and past frames contain no errors (i.e., BFI=0, PrevBFI=0). The machine advances to state 1 when an error is detected in the current frame. (The error can be detected using an 8-bit cyclic redundancy check on the frame). The state machine successively advances to higher states (up to the maximum state of 6) upon the detection of further errors in subsequent frames. When a good (i.e., error-free) frame is detected, the state machine reverts back to state 0, unless the state machine is currently in state 6, in which case it reverts back to state 5.
The decoder performs different error concealment operations depending on the state and values of flags BFI and PrevBFI. The condition BFI=0 and PrevBFI=0 (within state 0) pertains to the receipt of two consecutive error-free frames. In this condition, the decoder processes speech parameters in the typical manner set forth in the GSM-EFR 6.60 standard. The decoder then saves the current frame of speech parameters.
The condition BFI=0 and PrevBFI=1 (within states 0 or 5) pertains to the receipt of an error-free frame after receiving a xe2x80x9cbadxe2x80x9d frame. In this condition, the decoder limits the LTP gain and fixed codebook gain to the values used for the last received good subframe. In other words, if the value of the current LTP gain (gp) is equal to or less than the last good LTP gain received, then the current LTP gain is used. However, if the value of the current LTP gain is larger than the last good LTP gain received, then the value of the last LTP gain is used in place of the current LTP gain. The value for the gain of the fixed codebook is adjusted in a similar manner.
The condition BFI=1 (within any states 1 to 6, and PrevBFI=either 0 or 1) indicates that an error has been detected in the current frame. In this condition, the current LTP gain is replaced by the following gain:
gP=xcex1state(n)xc2x7gP(xe2x88x921) if gP(xe2x88x921)xe2x89xa6median, else 
gP=xcex1state(n)xc2x7median if gP(xe2x88x921) greater than median,xe2x80x83xe2x80x83(Eq. 8) 
where gp designates the gain of the LTP filter, xcex1state(n) designates an attenuation coefficient which has a successively greater attenuating effect with increase in state n (e.g., xcex1state(1)=0.98, whereas xcex1state(6)=0.20), xe2x80x9cmedianxe2x80x9d designates the median of the gp values for the last five subframes, and gp (xe2x88x921) designates the previous subframe. The value for the gain of the fixed codebook is adjusted in a similar manner.
In the above-described state (i.e., when BFI=1), the decoder also updates the codebook gain in memory by using the average value of the last four values in memory. Furthermore, the decoder shifts the past LSFs toward their mean, i.e.:
xe2x80x83LSFxe2x80x94q1(i)=LSFxe2x80x94q2(i)=xcex2xc2x7pastxe2x80x94LSFxe2x80x94q(i)+(1xe2x88x92xcex2)xc2x7meanxe2x80x94LSF(i)xe2x80x83xe2x80x83(Eq. 9),
where LSF_q1(i) and LSF_q2(i) are two vectors from the current frame, xcex2 is a constant (e.g., 0.95), past_LSF_q(i) is the value of LSF_q2 from the previous frame, and mean_LSF(i) is the average LSF value. Still further, the decoder replaces the LTP-lag values by the past lag value from the 4th subframe. And finally, the fixed codebook excitation pulses received by the decoder are used as such from the erroneous frame.
1.4 Vocoders
FIG. 4 shows another type of speech decoder, the LPC-based vocoder 400. In this decoder, the LPC residual is created from noise vector 404 (for unvoiced sounds) or a static pulse form 406 (for voiced speech). A gain module 406 scales the residual to a desired level. The output of the gain module is supplied to an LPC filter block including LPC filter 408, having an exemplary function defined by:                                                         A              ⁡                              (                z                )                                      =                                          ∑                                  i                  =                  1                                n                            ⁢                              xe2x80x83                            ⁢                                                a                  i                                ⁢                                  z                                      -                    i                                                                                )                ,                            (                  Eq          .                      xe2x80x83                    ⁢          10                )            
where ai designates the coefficients of the filter which can be computed by minimizing the mean square of the prediction error. One known vocoder is designated as xe2x80x9cLPC-10.xe2x80x9d This decoder was developed for the U.S. military to provide low bit-rate communication. The LPC-10 vocoder uses 22.5 ms frames, corresponding to 54 bits/frame equal and 2.4 kbits/s.
In operation, the LPC-10 encoder (not shown) makes a voicing decision to use either the pulse train or the noise signal. In the LPC-10, this can be performed by forming a low-pass filtered version of the sampled input signal. The decision is based on the energy of the signal, maximum-to-minimum ratio of the signal, and the number of zero crossings of the signal. Voicing decisions are made for each half of the current frame, and the final voicing decision is based on these two half-frame decisions and the decisions from the next two frames.
The pitch is determined from a low-pass and inverse-filtered signal. The pitch gain is determined from the root mean square value (RMS) of the signal. Relevant parameters characterizing the coding are quantized, sent to the decoder, and used to produce a synthesized signal in the decoder. More particularly, this coding technique provides coding with ten coefficients.
The vocoder 400 uses a simpler synthesis model than the GSM-EFR technique and accordingly uses less bits than the GSM-EFR technique to represent the speech, which, however, results in inferior quality. The low bit-rate makes vocoders suitable as redundant encoders for speech (to be described below). Vocoders work well modeling voiced and unvoiced speech, but do not accurately handle plosives (representing complete closure and subsequent release of a vocal tract obstruction) and non-speech information (e.g., music).
Further details on conventional speech coding can be gleaned from the book Digital Speech: Coding for Low Bit Rate Communication Systems, A. M. Kondoz, 1994, John Wiley and Sons, which is incorporated herein by reference in its entirety.
Once coded, a communication system can transfer speech in a variety of formats. Packet-based networks transfer the audio data in a series of discrete packets.
Packet-based traffic can be subject to high packet loss ratios, jitter and reordering. Forward error correction (FEC) is one technique for addressing the problem of lost packets. Generally, FEC involves transmitting redundant information along with the coded speech. The decoder attempts to use the redundant information to reconstruct lost packets. Media-independent FEC techniques add redundant information based on the bits within the audio stream (independent of higher-level knowledge of the characteristics of the speech stream). On the other hand, media-dependent FEC techniques add redundant information based on the characteristics of the speech stream.
U.S. Pat. No. 5,870,412 to Schuster et al. describes one media-independent technique. This method appends a single forward error correction code to each of a series of payload packets. The error correction code is defined by taking the XOR sum of a preceding specified number of payload packets. A receiver can reconstruct a lost payload from the redundant error correction codes carried by succeeding packets, and can also correct for the loss of multiple packets in a row. This technique has the disadvantage of using a variable delay. Further, the XOR result must be of the same size as the largest payload used in the calculation.
FIG. 5 shows an overview of a media-based FEC technique. The encoder module 502 includes a primary encoder 508 and a redundant encoder 510. A packetizer 516 receives the output of the primary encoder 508 and the redundant encoder 510, and, in turn, sends its output over transmission medium 506. A decoder module 504 includes primary decoder 512 and redundant decoder 514. The output of the primary decoder 512 and redundant decoder 514 is controlled by control logic 518.
In operation, the primary encoder 508 generates primary-encoded data using a primary synthesis model. The redundant encoder 510 generates redundant-encoded data using a redundant synthesis model. The redundant synthesis model typically provides a more heavily-compressed version of the speech than the primary synthesis model (e.g., having a consequent lower bandwidth and lower quality). For instance, one known approach uses PCM-encoded data as primary-encoded speech, and LPC-encoded data as redundant-encoded speech (note, for instance, V. Hardman et al., xe2x80x9cReliable Audio for Use Over the Internet,xe2x80x9d Proc. INET""95, 1995). The LPC-encoded data has a much lower bit rate than the PCM-encoded data.
FIG. 6 shows how redundant data (represented by shaded blocks) may be appended to primary data (represented by non-shaded blocks). For instance, with reference to the topmost row of packets, the first packet contains primary data for frame n. Redundant data for the previous frame, i.e., frame nxe2x88x921, is appended to this primary data. In this manner, the redundant data within a packet always refers to previously transmitted primary data. The technique provides a single level of redundancy, but additional levels may be provided (by transmitting additional copies of the redundant data).
Specific formats have been proposed for appending the redundant data to the primary data payload. For instance, Perkins et al. proposes a specific format for appending LPC-encoded redundant data to primary payload data within the Real-time Transport Protocol (RTP) (e.g., note C. Perkins et al., xe2x80x9cRTP Payload for Redundant Audio Data,xe2x80x9d RFC 2198, September 1997). The packet header includes information pertaining to the primary data and information pertaining to the redundant data. For instance, the header includes a field for providing the timestamp of the primary encoding, which indicates the time of primary-encoding of the data. The header also includes an offset timestamp, which indicates the difference in time between the primary encoding and redundant encoding represented in the packet.
With reference to both FIGS. 5 and 6, the decoder module 504 receives the packets containing both primary and redundant data. The decoder module 504 includes logic (not shown) for separating the primary data from the redundant data. The primary decoder 512 decodes the primary data, while the redundant decoder 514 decodes the redundant data. More specifically, the decoder module 504 decodes primary-data frame n when the next packet containing the redundant data for frame n arrives. This delay is added on playback and is represented graphically in FIG. 6 by the legend xe2x80x9cExtra delay.xe2x80x9d In the prior art technique, the control logic 518 instructs the decoder module 504 to use-the synthesized speech generated by the primary decoder 512 when a packet is received containing primary-encoded data. On the other hand, the control logic 518 instructs the decoder module 504 to use synthesized speech generated by the redundant decoder 514 when the packet containing primary data is xe2x80x9clost.xe2x80x9d In such a case, the control logic 518 simply serves to fill in gaps in the received stream of primary-encoded frames with redundant-encoded frames. For example, in the above-referenced technique described in Hardman et al., the decoder will decode the LPC-encoded data in place of the PCM-encoded data upon detection of packet loss in the PCM-encoded stream.
The use of conventional FEC to improve the quality of packet-based audio transmission is not fully satisfactory. For instance, speech synthesis models use the parameters of past operational states to generate accurate speech synthesis in present operational states. In this sense, the models are xe2x80x9chistory-dependent.xe2x80x9d For example, an algebraic code-excited linear prediction (ACELP) speech model uses previously produced syntheses to update its adaptive codebook. The LPC filter, error concealment histories, and various quantization-predictors also use previous states to accurately generate speech in current states. Thus, even if a decoder can reconstruct missing frames using redundant data, the xe2x80x9cmemoryxe2x80x9d of the primary synthesis model is deficient due to the loss of primary data. This can create xe2x80x9clingeringxe2x80x9d problems in the quality of speech synthesis. For example, a poorly updated adaptive codebook can cause distorted waveforms for more than ten frames. Conventional FEC techniques do nothing to address these types of lingering problems.
Furthermore, FEC-based speech coding techniques may suffer from a host of other problems not heretofore addressed by FEC techniques. For instance, in analysis-by-synthesis techniques using linear predictors, phase discontinuities may be very audible. In techniques using an adaptive codebook, a phase error placed in the feedback loop may remain for numerous frames. Further, in speech encoders using LP coefficients that are predicted when encoded, a loss of the LPC parameter lowers the precision of predictor. This will introduce errors into the most important parameter in an LPC speech coding technique.
It is accordingly a general objective of the present invention to improve the quality of speech produced using the FEC technique.
This and other objectives are achieved by the present invention through an improved FEC technique for coding speech data. In the technique, an encoder module primary-encodes an input speech signal using a primary synthesis model to produce primary-encoded data, and redundant-encodes the input speech signal using a redundant synthesis model to produce redundant-encoded data. A packetizer combines the primary-encoded data and the redundant-encoded data into a series of packets and transmits the packets over a packet-based network, such as an Internet Protocol (IP) network. A decoding module primary-decodes the packets using the primary synthesis model, and redundant-decodes the packets using the redundant synthesis model. The technique provides interaction between the primary synthesis model and the redundant synthesis model during and after decoding to improve the quality of the synthesized output speech signal. Such xe2x80x9cinteraction,xe2x80x9d for instance, may take the form of updating states in one model using the other model.
Further, the present technique takes advantage of the FEC-staggered coupling of primary and redundant frames (i.e., the coupling of primary data for frame n with redundant data for frame nxe2x88x921) to provide look-ahead processing at the encoder module and the decoder module. The look-ahead processing supplements the available information regarding the speech signal, and thus improves the quality of the output synthesized speech.
The interactive cooperation of both models to code speech signals greatly expands the use of redundant coding heretofore contemplated by conventional systems.