In digital cellular systems, a bit stream is said to be transmitted through a communication channel connecting a mobile station to a base station over the air interface. The bit stream is organized into frames, including speech frames. Whether or not an error occurs during transmission depends on prevailing channel conditions. A speech frame that is detected to contain errors is called simply a bad frame. According to the prior art, in case of a bad frame, speech parameters derived from past correct parameters (of non-erroneous speech frames) are substituted for the speech parameters of the bad frame. The aim of bad frame handling by making such a substitution is to conceal the corrupted speech parameters of the erroneous speech frame without causing a noticeable degrading of the speech quality.
Modern speech codecs operate by processing a speech signal in short segments, the above-mentioned frames. A typical frame length of a speech codec is 20 ms, which corresponds to 160 speech samples, assuming an 8 kHz sampling frequency. In so-called wideband codecs, frame length can again be 20 ms, but can correspond to 320 speech samples, assuming a 16 kHz sampling frequency. A frame may be further divided into a number of subframes.
For every frame, an encoder determines a parametric representation of the input signal. The parameters are quantized and then transmitted through a communication channel in digital form. A decoder produces a synthesized speech signal based on the received parameters (see FIG. 1).
A typical set of extracted coding parameters includes spectral parameters (so called linear predictive coding parameters, or LPC parameters) used in short-term prediction, parameters used for long-term prediction of the signal (so called long-term prediction parameters or LTP parameters), various gain parameters, and finally, excitation parameters.
What is called linear predictive coding is a widely used and successful method for coding speech for transmission over a communication channel; it represents the frequency shaping attributes of the vocal tract. LPC parameterization characterizes the shape of the spectrum of a short segment of speech. The LPC parameters can be represented as either LSFs (Line Spectral Frequencies) or, equivalently, as ISPs (Immittance Spectral Pairs). ISPs are obtained by decomposing the inverse filter transfer function A(z) to a set of two transfer functions, one having even symmetry and the other having odd symmetry. The ISPs, also called Immittance Spectral Frequencies (ISFs), are the roots of these polynomials on the z-unit circle. Line Spectral Pairs (also called Line Spectral Frequencies) can be defined in the same way as Immittance Spectral Pairs; the difference between these representations is the conversion algorithm, which transforms the LP filter coefficients into another LPC parameter representation (LSP or ISP).
Sometimes the condition of the communication channel through which the encoded speech parameters are transmitted is poor, causing errors in the bit stream, i.e. causing frame errors (and so causing bad frames). There are two kinds of frame errors: lost frames and corrupted frames. In a corrupted frame, only some of the parameters describing a particular speech segment (typically of 20 ms duration) are corrupted. In a lost frame type of frame error, a frame is either totally corrupted or is not received at all.
In a packet-based transmission system for communicating speech (a system in which a frame is usually conveyed as a single packet), such as is sometimes provided by an ordinary Internet connection, it is possible that a data packet (or frame) will never reach the intended receiver or that a data packet (or frame) will arrive so late that it cannot be used because of the real-time nature of spoken speech. Such a frame is called a lost frame. A corrupted frame in such a situation is a frame that does arrive (usually within a single packet) at the receiver but that contains some parameters that are in error, as indicated for example by a cyclic redundancy check (CRC). This is usually the situation in a circuit-switched connection, such as a connection in a system of the global system for mobile communication (GSM) connection, where the bit error rate (BER) in a corrupted frame is typically below 5%.
Thus, it can be seen that the optimal corrective response to an incidence of a bad frame is different for the two cases of bad frames (the corrupted frame and the lost frame). There are different responses because in case of corrupted frames, there is unreliable information about the parameters, and in case of lost frames, no information is available.
According to the prior art, when an error is detected in a received speech frame, a substitution and muting procedure is begun; the speech parameters of the bad frame are replaced by attenuated or modified values from the previous good frame, although some of the least important parameters from the erroneous frame are used, e.g. the code excited linear prediction parameters (CELPs), or more simply the excitation parameters.
In some methods according to the prior art, a buffer is used (in the receiver) called the parameter history, where the last speech parameters received without error are stored. When a frame is received without error, the parameter history is updated and the speech parameters conveyed by the frame are used for decoding. When a bad frame is detected, via a CRC check or some other error detection method, a bad frame indicator (BFI) is set to true and parameter concealment (substitution for and muting of the corresponding bad frames) is then begun; the prior-art methods for parameter concealment use parameter history for concealing corrupted frames. As mentioned above, when a received frame is classified as a bad frame (BFI set to true), some speech parameters may be used from the bad frame; for example, in the example solution for corrupted frame substitution of a GSM AMR (adaptive multi-rate) speech codec given in ETSI (European Telecommunications Standards Institute) specification 06.91, the excitation vector from the channel is always used. When a speech frame is lost (including the situation where a frame arrives too late to be used, such as for example in some IP-based transmission systems), obviously no parameters are available from the lost frame to be used.
In some prior-art systems, the last good spectral parameters received are substituted for the spectral parameters of a bad frame, after being slightly shifted towards a constant predetermined mean. According to the GSM 06.91 ETSI specification, the concealment is done in LSF format, and is given by the following algorithm,
For i=0 to N−1:LSF—q1(i)=α*past—LSF—q(i)+(1−α)*mean—LSF(i);  (eq. 1.0)LSF—q2(i)=LSF—q1(i);where α=0.95 and N is the order of the linear predictive (LP) filter being used. The quantity LSF_q1 is the quantized LSF vector of the second subframe, and the quantity LSF_q2 is the quantized LSF vector of the fourth subframe. The LSF vectors of the first and third subframes are interpolated from these two vectors. (The LSF vector for the first subframe in the frame n is interpolated from LSF vector of fourth subframe in the frame n−1, i.e. the previous frame). The quantity past_LSF_q is the quantity LSF_q2 from the previous frame. The quantity mean_LSF is a vector whose components are predetermined constants; the components do not depend on a decoded speech sequence. The quantity mean_LSF with constant components generates a constant speech spectrum.
Such prior-art systems always shift the spectrum coefficients towards constant quantities, here indicated as mean_LSF(i). The constant quantities are constructed by averaging over a long time period and over several successive talkers. Such systems therefore offer only a compromise solution, not a solution that is optimal for any particular speaker or situation; the tradeoff of the compromise is between leaving annoying artifacts in the synthesized speech, and making the speech more natural in how it sounds (i.e. the quality of the synthesized speech).
What is needed is an improved spectral parameter substitution in case of a corrupted speech frame, possibly a substitution based on both an analysis of the speech parameter history and the erroneous frame. Suitable substitution for erroneous speech frames has a significant effect on the quality of the synthesized speech produced from the bit stream.