The problem of speech coding (compressing speech into a small number of bits) has a large number of applications, and as a result has received considerable attention in the literature. One class of speech coders (vocoders) which have been extensively studied and used in practice is based on an underlying model of speech. Examples from this class of coders include linear prediction vocoders, homomorphic vocoders, and channel vocoders. In these vocoders, speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for unvoiced sounds. Speech is analyzed by first segmenting speech using a window such as a Hamming window. Then, for each segment of speech, the excitation parameters and system parameters are estimated and quantized. The excitation parameters consist of the voiced/unvoiced decision and the pitch period. The system parameters consist of the spectral envelope or the impulse response of the system. In order to reconstruct speech, the quantized excitation parameters are used to synthesize an excitation signal consisting of a periodic impulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the quantized system parameters.
Even though vocoders based on this underlying speech model have produced intelligible speech, they have not been successful in producing high quality speech. As a consequence, they have not been widely used for high quality speech coding. The poor quality of the reconstructed speech is in part due to the inaccurate estimation of the model parameters and in part due to limitations in the speech model.
Another speech model, referred to as the Multi-Band Excitation (MBE) speech model, was developed by Griffin and Lim in 1984. Speech coders based on this speech model were developed by Griffin and Lim in 1986, and they were shown to be capable of producing high quality speech at rates above 8000 bps (bits per second). Subsequent work by Hardwick and Lim produced a 4800 bps MBE speech coder, which used more sophisticated quantization techniques to achieve similar quality at 4800 bps that earlier MBE speech coders had achieved at 8000 bps.
The 4800 bps MBE speech coder used a MBE analysis/synthesis system to estimate the MBE speech model parameters and to synthesize speech from the estimated MBE speech model parameters. As shown schematically with respect to FIGS. 1A, 1B and 1C, a discrete speech signal, denoted by s (FIG. 1B), is obtained by sampling an analog (such as electromagnetic) speech signal AS (FIG. 1A). This is typically done at an 8 kHz sampling rate, although other sampling rates can easily be accommodated through a straightforward change in the various system parameters. The system divides the discrete speech signal s into small overlapping segments by multiplying s with a window (such as a Hamming Window or a Kaiser window) to obtain a windowed signal segment s.sub.w (n) (FIG. 1B) (where n is the segment index). Each speech signal segment is then transformed from the time domain to the frequency domain to generate segment frames F.sub.w (n) (FIG. 1C). Each frame is analyzed to obtain a set of MBE speech model parameters that characterize that frame. The MBE speech model parameters consist of a fundamental frequency, or equivalently, a pitch period, a set of voiced/unvoiced decisions, a set of spectral amplitudes, and optionally a set of spectral phases. These model parameters are then quantized using a fixed number of bits (for instance, digital electromagnetic signals) for each frame. The resulting bits can then be used to reconstruct the speech signal (e.g. an electromagnetic signal), by first reconstructing the MBE model parameters from the bits and then synthesizing the speech from the model parameters. A block diagram of the steps taken to code the spectral amplitudes by a typical MBE speech coder such as disclosed in U.S. Ser. No. 624,878, now U.S. Pat. No. 5,226,084, is shown in FIG. 2.
The invention described herein applies to many different speech coding methods, which include but are not limited to linear predictive speech coders, channel vocoders, homomorphic vocoders, sinusoidal transform coders, multi-band excitation speech coders and improved multiband excitation (IMBE) speech coders. For the purpose of describing this invention in detail, a 7.2 kbps IMBE speech coder is used. This coder uses the robust speech model, referred to above as the Multi-Band Excitation (MBE) speech model. Another similar speech coder has been standardized as part of the INMARSAT-M (International Marine Satellite Organization) satellite communication system.
Efficient methods for quantizing the MBE model parameters have been developed. These methods are capable of quantizing the model parameters at virtually any bit rate above 2 kbps. The representative 7.2 kbps IMBE speech coder uses a 50 Hz frame rate. Therefore 144 bits are available per frame. Of these 144 bits, 57 bits are reserved for forward error correction and synchronization. The remaining 87 bits per frame are used to quantize the MBE model parameters, which consist of a fundamental frequency .omega., a set of K voiced/unvoiced decisions and a set of L spectral amplitudes M.sub.1. The values of K and L vary depending on the fundamental frequency of each frame. The 87 available bits are divided among the model parameters as shown in Table 1.
TABLE 1 ______________________________________ Bit allocation Parameter Number of Bits ______________________________________ Fundamental Frequency 8 Voiced/Unvoiced Decision K Spectral Amplitudes 79 - K ______________________________________
Although this spectral amplitude quantization method was designed for use in an MBE speech coder the quantization techniques are equally useful in a number of different speech coding methods, such as the Sinusoidal Transform Coder and the Harmonic Coder.
As used herein, parameters designated with the hat accent ( ) are the parameters as determined by the encoder, before they have been quantized or transmitted to the decoder. Parameters designated with the tilde accent (.about.) are the corresponding parameters, that have been reconstructed from the bits to be transmitted, either by the decoder or by the encoder as it anticipatorily mimics the decoder, as explained below. Typically, the path from coder to decoder entails quantization of the hat parameter, followed by coding and transmission, followed by decoding and reconstruction. The two parameters can differ due to quantization and also due to bit errors introduced in the coding and transmission process. As explained below, in some instances, the coder uses the .about. parameters to anticipate action that the decoder will take. In such instances, the parameters used by the coder have been quantized, and reconstructed, but will not have been subject to possible bit errors.
For a particular speech segment,the fundamental frequency .omega. is quantized by first converting it to its equivalent pitch period. Estimation of the fundamental frequency is described in detail in U.S. Ser. No. 624,878 and the PCT application.
The value of .omega. is typically restricted to a range. .omega. is quantized by converting it to a pitch period. In general, a two step estimation method is used, with an initial pitch period (which is related to the fundamental frequency by a specific function ##EQU1## estimate being restricted to a set of specified pitch periods, for instance corresponding to ##EQU2## The parameter P is uniformly quantized using 8 bits and a step size of 0.5. This corresponds to a pitch period accuracy of one half sample. The pitch period is then refined to obtain the final estimate which has one-quarter-sample accuracy. The pitch period is quantized by finding the value: ##EQU3## The quantity b.sub.0 can be represented with eight bits using the following unsigned binary representation:
TABLE 2 ______________________________________ Eight Bit Binary Representation value bits ______________________________________ 0 0000 0000 1 0000 0001 2 0000 0010 . . . . . . 255 1111 1111 ______________________________________
This binary representation is used throughout the encoding and decoding of the IMBE model parameters.
For a particular segment, L denotes the number of spectral amplitudes in the frequency domain transform of that segment. The value of L is derived from the fundamental frequency for that frame, .omega., according to the relationship, ##EQU4## where 0.ltoreq..beta..ltoreq.1.0 determines the speech bandwidth relative to half the sampling rate. The function .left brkt-bot.x.right brkt-bot., referred to in Equation (2) (a "floor" function), is equal to the largest integer less than or equal to x. The L spectral amplitudes are denoted by M.sub.1 for 1.ltoreq.1.ltoreq.L where M.sub.1 is the lowest frequency spectral amplitude and M.sub.L is the highest frequency spectral amplitude.
The fundamental frequency is generated in the decoder by decoding and reconstructing the received value, to arrive at b.sub.0, from which .omega. can be generated according to the following: ##EQU5##
The set of windowed spectral amplitudes for the current speech segment are identified as s.sub.w (0) (with the parenthetical numeral 0 indicating the current segment, -1 indicating the preceding segment, +1 indicating the following segment, etc.) are quantized by first calculating a set of predicted spectral amplitudes based on the spectral amplitudes of the previous speech segment s.sub.w (-1). The predicted results are compared to the actual spectral amplitudes, and the difference for each spectral amplitude, termed a prediction residual, is calculated. The prediction residuals are passed to and used by the decoder.
The general method is shown schematically with reference to FIG. 2 and FIG. 3. (The process is recursive from one segment to the next, and also in some respect, between the coder and the decoder. Therefore, the explanation of the process is necessarily a bit circular, and starts midstream.) The vector M(0) is a vector of L unquantized spectral amplitudes, which define the spectral envelope of the current sampled window s.sub.w (0). For instance, as shown in FIG. 1C, M(0) is a vector of twenty-one spectral amplitudes, for the harmonic frequencies that define the shape of the spectral envelope for frame F.sub.w (0). L in this case is twenty-one. M.sub.1 (0) represents the 1.sup.th element in the vector, 1.ltoreq.1.ltoreq.L.
In general, the method includes coding steps 202 (FIG. 2), which take place in a coder, and decoding steps 302 (FIG. 3), which take place in a separate decoder. The coding steps include the steps discussed above, not shown in FIG. 2: i.e., sampling the analog signal AS; applying a window to the sampled signal AS, to establish a segment s.sub.w (n) of sampled speech; transforming the sampled segment s.sub.w (n) from the time domain into a frame F.sub.w (n) in the frequency domain; and identifying the MBE speech model parameters that define that segment, i.e.: fundamental frequency .omega.; spectral amplitudes M(0) (also known as samples of the spectral envelope); and voiced/unvoiced decisions. These parameters are used as inputs to conduct the additional coding steps shown in FIG. 2. (The fundamental frequency in the cited literature is often specified with the subscript 0 as .omega..sub.0, to distiguish the fundamental from the harmonic frequencies. However, in the following, the subscript is not used, for typographical clarity. No confusion is caused, because the harmonic frequencies are not referred to with the variable .omega..)
The method uses post transmission prediction in the decoding steps conducted by the decoder as well as differential coding. A packet of data is sent from the coder to the decoder representing model parameters of each spectral segment. However, for some of the parameters, such as the spectral amplitudes, the coder does not transmit codes representing the actual full value of the parameter (except for the first frame). This is because the decoder makes a rough prediction of what the parameter will be for the current frame, based on what the decoder determined the parameter to be for the previous frame (based in turn on a combination of what the decoder previously received, and what it had determined for the frame preceding the preceding segment, and so on). Thus, the coder only codes and sends the difference between what the decoder will predict and the actual values. These differences are referred to as "prediction residuals." This vector of differences will, in general, require fewer bits for coding than would the coding of the absolute parameters.
The values that the decoder generates as output from summation 316 are the logarithm base 2 of what are referred to as the quantized spectral amplitudes, designated by the vector M. This is distinguished from the vector of unquantized values M, which is the input to log.sub.2 block 204 in the coder. (The prediction steps are grouped in the dashed box 340.) To compute the vector of spectral log amplitudes log.sub.2 M(0) for the current frame, the decoder stores the vector log.sub.2 M(-1) for the previous segment by taking a frame delay 312. At 314, the decoder computes the predicted spectral log amplitudes according to a method discussed below. It uses as inputs, the vector log.sub.2 M(-1) and the reconstructed fundamental frequencies for the previous segment .omega.(-1) and for the current segment .omega.(0), which have been received and decoded by the decoder before decoding of the spectral log amplitudes.
These predicted values for the spectral log amplitudes are added at 316, with the decoded differential prediction residuals, that have been transmitted by the coder. The steps of reconstruction 318, reverse DCT transform 320 and reformation into six blocks 322 are explained below. It is only necessary to know that they decode a received vector b that the coder has generated, coding for the differences between, on the one hand, the actual values for log.sub.2 M(0) that must ultimately be recreated by the decoder and, on the other hand, the values that the coder has calculated will be predicted by the decoder in step 314. Because the coder cannot communicate with the decoder, to anticipate the prediction to be made by the decoder, the coder must also make the prediction, as closely as possible to the manner in which the decoder will make the prediction. The prediction in the decoder is based on the values log.sub.2 M(-1) for the previous segment generated by the decoder. Therefore, the coder must also generate these values, as if it were the decoder, as discussed below, so that it anticipatorily mirrors the steps that will be taken by the decoder.
Thus, if the coder accurately anticipates the prediction that the decoder will make with respect to the spectral log amplitues log.sub.2 M(0), the values b to be transmitted by the encoder will reflect the difference between the prediction and the actual values log.sub.2 M(0). In the decoder, at 316, upon addition, the result is log.sub.2 M(0) a quantized version of the actual values log.sub.2 M(0).
The coder, during the simulation of the decoder steps at 240, conducts steps that correspond to the steps that will be performed by the decoder, in order for the coder to anticipatorily mirror how the decoder will predict the values for log.sub.2 M(0) based on the previous computed values log.sub.2 M(-1). In other words, the coder conducts steps 240 that mimic the steps conducted by the decoder. The coder has previously produced the actual values M(0). The logarithm base two of this vector is taken at 204. At 216, the coder subtracts from this logarithm vector, a vector of the predicted spectral log amplitudes , calculated at step 214. The coder uses the same steps for computing the predicted values as will the decoder, and uses the same inputs as will the decoder, .omega.(0), .omega.(-1), which are the reconstructed fundamental frequencies and log.sub.2 M(-1). It will be recalled, that log.sub.2 M(-1) is the value that the decoder has computed for the previous frame (after the decoder has performed its rough prediction and then adjusted the prediction with addition of the prediction residual values transmitted by the coder).
Thus, the coder generates log.sub.2 M(-1) by performing the exact steps that the decoder performed to generate log.sub.2 M(-1). With respect to the previous segment, the coder had sent to the decoder, a vector b.sub.1 (-1) where 2.ltoreq.1.ltoreq.L+3. (The generation of the vector b.sub.1 is discussed below.) Thus, to recreate the steps that the decoder will perform, at 218, the coder reconstructs the values of the vector b.sub.1 (-1) into DCT coefficients as the decoder will do. An inverse DCT transform (or inverse of whatever suitable transform is used in the forward transformation part of the coder at step 206) is performed at 220, and reformation into blocks is conducted at 222. At this point, the coder will have produced the same vector as the decoder produces at the output of reformation step 322. At 226, this is added to the predicted spectral log amplitudes for the previous frame F.sub.w (-2), to arrive at the output from decoder log.sub.2 M(-1). The result of the summation in the coder at 226, log.sub.2 M(-1), is stored by implementing a frame delay 212, after which it is used as discussed above to simulate the decoder's prediction of log.sub.2 M(0).
The vector b.sub.1 is generated in the coder as follows. At 216, the coder subtracts the vector that the coder calculates the decoder will predict, from the actual values of log.sub.2 M(0) to produce a vector T. At 210, this vector is divided into blocks, for instance six, and at 206 a transform, such as a DCT is performed. Other sorts of transforms, such as Discrete Fourier, may also be used. The output of the DCT transform is organized in two groups: a set of D.C. values, associated into a vector referred to as the Prediction Residual Block Average (PRBA); and the remaining, higher order coefficients, both of which are quantized at 208 and are designated as the vector b.sub.1.
These values are sent to the decoder, and are also used in the steps 240 to simulate the decoder mentioned above, to simulate how the decoder will predict the vector for the current segment.
Special considerations are taken with respect to the first segment, since the decoder will not have at its disposal a preceding segment to use in its predictions.
The foregoing method, of coding and decoding using predicted values and transmitted prediction residuals, is dicussed fully in the PCT patent application and U.S. Ser. No. 624,878, now U.S. Pat. No. 5,226,084.
This quantization method provides very good fidelity using a small number of bits and it maintains this fidelity as L varies over its range. The computational requirements of this approach are well within the limits required for real-time implementation using a single DSP such as the DSP32C available from AT & T. This quantization method separates the spectral amplitudes into a few components, such as the mean of the PRBA vector, that are sensitive to bit errors and a large number of other components that are not very sensitive to bit errors. Forward error correction can then be used in an efficient manner by providing a high degree of protection for the few sensitive components and a lesser degree of protection for the remaining components.
Turning now to a rudimentary known method by which the decoder predicts at 314 the values for the spectral amplitudes of the current segment, based on the spectral amplitudes of the previous segment, as has been mentioned, L(0) denotes the number of spectral amplitudes in the current speech segment and L(-1) denotes the number of spectral amplitudes in the previous speech segment. A rudimentary method for generating the prediction residuals, T.sub.1 for 1.ltoreq.1.ltoreq.L(0) is given by, ##EQU6## where M.sub.1 (0) denotes the spectral amplitudes of the current speech segment and M.sub.1 (-1) denotes the quantized spectral amplitudes of the previous speech segment. The constant .gamma., a decay factor, is typically equal to 0.7, however any value in the range 0.ltoreq..gamma..ltoreq.1 can be used. The effect and purpose of the constant .gamma. are explained below.
For instance, as shown in FIG. 1c, L(0) is 21 and L(-1) is 7. The fundamental frequency .omega.(0) of the current frame F.sub.w (0) is 3.alpha. and the fundamental frequency .omega.(-1) of the previous frame F.sub.w (-1) is .alpha.. It has been determined that it is often the case that the shape of the spectral envelope curve h for adjacent segments of speech is rather similar. As can be seen from FIG. 1C, the shape of the spectral envelope in the previous frame and the current frame (i.e., the shape of curves h(0) and h(-1)) is rather similar. The fundamental frequency of each segment can differ significantly while the envelope shape remains similar.
From inspection of equation 4, it can be seen that this method works well if .omega.(0) and .omega.(-1) are relatively close to each other, however, if they differ significantly, the prediction can be quite inaccurate. Each harmonic amplitude can be identified by an index number representing its position along the frequency axis. For instance, for the example set forth above, according to the rudimentary method, the value for the first of the harmonic amplitudes in the current frame, would be predicted to be equal to the value of the first harmonic amplitude in the previous frame. Similarly, the value of the fourth harmonic amplitude would be predicted to be equal to the value of the fourth harmonic amplitude in the previous frame. This, despite the fact that the fourth harmonic amplitude in the current frame is closer in value to an interpolation between the amplitudes of the first and second harmonics of the previous frame, rather than to the value of the fourth harmonic. Further, the eighth through twenty-first harmonic amplitudes of the current frame would all have the value of the last L(-1) or seventh harmonic amplitude of the previous frame.
This rudimentary method does not account for any change in the fundamental frequency .omega. between the previous segment and current frame. In order to account for the change in the fundamental frequency, the PCT application, and U.S. Ser. No. 624,878 (now U.S. Pat. No. 5,226,084) disclose a method that first interpolates a spectral amplitude of the previous segment that may fall between harmonics. For instance, the frequency 1/3 of the way between the second and the third harmonics of the previous frame is interpolated. This is typically done using linear interpolation, however various other forms of interpolation could also be used. Then the interpolated spectral amplitudes of the previous frame are resampled at the frequency points corresponding to the harmonic in question of the current frame. This combination of interpolation and resampling produces a set of predicted spectral amplitudes, which have been corrected for any inter-frame change in the fundamental frequency.
It is helpful to define a value relative to the 1th index of the current frame: ##EQU7## Thus, k.sub.1 represents a relative index number. If the ratio of the current to the previous fundamental frequencies is 1/3, as in the example, k.sub.1 is equal to 1/3.multidot.1, for each index number 1.
If linear interpolation is used to compute the predicted spectral log amplitudes, then a predicted spectral log amplitude for the lth harmonic of the current frame can be expressed as: ##EQU8## where .gamma. is as above.
Thus, the predicted value is interpolated between two actual values of the previous frame. For instance, the predicted value for the seventh harmonic amplitude (1=7) is equal to 2/3 (the term in bracket a) the log of the second amplitude of the prior frame (term in bracket b) plus 1/3 (term in bracket c) the log of the third amplitude of the prior frame (term in bracket d). Thus, the predicted value is a sort of weighted average between the two harmonic amplitudes of the previous frame closest in frequency to the harmonic amplitude in question of the current frame
Thus, this value is the value that the decoder will predict for the log amplitude of the harmonic frequencies that define the spectral envelope for the current frame. The coder also generates this prediction value in anticipation of the decoder's prediction, and then calculates a prediction residual vector, T.sub.1, essentially equal to the difference between the actual value the coder has generated and the predicted value that the coder has calculated that the decoder will generate: EQU T.sub.1 =log.sub.2 M.sub.1 (0)- (7)
It is disclosed in U.S. Ser. No. 624,878 that .gamma., which is incorporated into , can be adaptively changed from frame to frame in order to improve performance.
If the current and previous fundamental frequencies are the same, the improved method results are identical to the rudimentary method. In other cases the improved method produces a prediction residual with lower variance than the former method. This allows the prediction residuals to be quantized with less distortion for a given number of bits.
Turning now to an explanation of the purpose for the factor .gamma., as has been mentioned, the coder does not transmit absolute values from the coder to the decoder. Rather, the coder transmits a differential value, calculated to be the difference between the current value, and a prediction of the current value made on the basis of previous values. The differential value that is received by the decoder can be erroneous, either due to computation errors or bit transmission errors. If so, the error will be incorporated into the current reconstructed frame, and will further be perpetuated into subsequent frames, since the decoder makes a prediction for the next frame based on the previous frame. Thus, the erroneous prediction will be used as a basis for the reconstruction of the next segment.
The encoder does include a mirror, or duplicate of the portion of the decoder that makes the prediction. However, the inputs to the duplicate are not values that may have been corrupted during transmission, since, such errors arise unexpectedly in transmission and cannot be duplicated. Therefore, differences can arise between the predictions made by the decoder, and the mirroring predictions made in the encoder. These differences detract from the quality of the coding scheme.
Thus, the factor .gamma. causes any such error to "decay" away after a number of future segments, so that any errors are not perpetuated indefinitely. This is shown schematically in FIG. 4. Sub panels A and B of FIG. 4 show the effect of a transmitted error with no factor .gamma. (which is the same as .gamma. equal to 1). The amplitude of a single spectral harmonic is shown for the current frame x(0), and the five preceding frames x(-1), x(-2), etc. The vertical axis represents amplitude and the horizontal axis represents time. The values sent .delta.(n) are indicated below the amplitude which is recreated from the differential value being added to the previous value. (This example does not exactly follow the method under discussion, since it does not include any prediction, or interpolation, for simplicity. It is merely designed to show how an error is perpetuated in a differential coding scheme, and how the factor .gamma. can be used to reduce the error over time.) The original values are represented as points and the reconstructed values are represented as boxes. For instance, .delta.(-4) equals 10, the difference x(-4) minus x(-5). Similarly, .delta.(-2) equals -20, the difference x(-2) minus x(-3). The reconstructions are according to the formula: EQU x(n)=x(n-1)+.delta.(n). (8)
Panel A shows the situation if the correct .delta. values are sent. The reconstructed values equal the original values.
Panel B shows the situation if an incorrect value is transmitted, for instance .delta.(-2) equals +40 rather than +10. The reconstructed value for x(-3) equals 50, rather than 20, and all of the subsequent values, which are based on x(-3), are offset by 30 from the correct original. The error perpetuates in time.
Panel C shows the situation if a factor .gamma. is used. The differential that will be sent is no longer the simple difference, but rather: EQU .delta.(n)=x(n)-.gamma..multidot.(x(n-1)). (9)
Consequently, the reconstructions are according to the following formula: EQU x(n)=.gamma..multidot.x(n-1)+.delta.(n). (10)
Thus, .delta.(-3) equals +12.5, etc. If no error corrupts the values sent, the reconstructed values (boxes) are the same as the original, as shown in panel C. However, if an error, such as a bit error corrupts the differential values sent, such as sending .delta.(-3) equals +40 rather than +12.5, the effect of the error is minimized, and decays with time. The errant value is reconstructed as 47.5 rather than the 50 that would be the case with no decay factor. The next value, which should be zero, is reconstructed as 20.63, rather than as 30 in the case where no .gamma. decay factor is used. The next value, also properly equal to zero, is reconstructed as 15.47, which, although incorrect, is closer to being correct than the 30 that would again be calculated without the decay factor. The next calculated value is even closer to being correct, and so on.
The decay factor can be any number between zero and one. If a smaller factor, such as 0.5 is used, the error will decay away faster. However, less of a coding advantage will be gained from the differential coding, because the differential is necessarily increased. The reason for using differential coding is to obtain an advantage when the frame-to-frame difference, as compared to the absolute value, is small. In such a case, there is a significant coding advantage for differential coding. Decreasing the value of the decay factor increases the differences between the predicted and the actual values, which means more bits must be used to achieve the same quantization accuracy.
Returning to a discussion of the coding process, the prediction residuals T.sub.1 are divided into blocks. A preferred method for dividing the residuals into blocks and then generating DCT coefficients is disclosed fully in U.S. Ser. No. 624,878 (now U.S. Pat. No. 5,226,084) and the PCT application.
Once each DCT coefficient has been quantized using the number of bits specified by a bit allocation rule, the binary representation can be transmitted, stored, etc., depending on the application. The spectral log amplitudes can be reconstructed from the binary representation by first reconstructing the quantized DCT coefficients for each block, performing the inverse DCT on each block, and then combining with the quantized spectral log amplitudes of the previous segment using the inverse of Equation (7).
Since bit errors exist in many speech coder applications, a robust speech coder must be able to correct, detect and/or tolerate bit errors. One technique which has been found to be very successful is to use error correction codes in the binary representation of the model parameters.
Error correction codes allow infrequent bit errors to be corrected, and they allow the system to estimate the error rate. The estimate of the error rate can then be used to adaptively process the model parameters to reduce the effect of any remaining bit errors.
According to a representative error correction and protection method, the quantized speech model parameter bits are divided into three or more different groups according to their sensitivity to bit errors, and then different error correction or detection codes are used for each group. Typically the group of data bits which is determined to be most sensitive to bit errors is protected using very effective error correction codes. Less effective error correction or detection codes, which require fewer additional bits, are used to protect the less sensitive data bits. This method allows the amount of error correction or detection given to each group to be matched to its sensitivity to bit errors. The degradation caused by bit errors is relatively low, as is the number of bits required for forward error correction.
The particular choice of error correction or detection codes which is used depends upon the bit error statistics of the transmssion or storage medium and the desired bit rate. The most sensitive group of bits is typically protected with an effective error correction code such as a Hamming code, a BCH code, a Golay code or a Reed-Solomon code.
Less sensitive groups of data bits may use these codes or an error detection code. Finally the least sensitive groups may use error correction or detection codes or they, may not use any form of error correction or detection. The error correction and detection codes used herein are well suited to a 6.4 kbps IMBE speech coder for satellite communications.
In the representative speech coder, (and also in the coder that was standardized for the INMARSATM satellite communication system), the bits per frame which are reserved for forward error correction are divided among [23,12] Golay codes which can correct up to 3 errors, [15,11] Hamming codes which can correct single errors and parity bits. The six most significant bits from the fundamental frequency .omega. and the three most significant bits from the mean of the PRBA vector are first combined with three parity check bits and then encoded in a [23,12] Golay code. Thus, all of the six most significant bits are protected against bit errors. A second Golay code is used to encode the three most significant bits from the PRBA vector and the nine most sensitive bits from the higher order DCT coefficients. All of the remaining bits except the seven least sensitive bits are then encoded into five [15,11] Hamming codes. The seven least significant bits are not protected with error correction codes.
At the decoder the received bits are passed through Golay and Hamming decoders, which attempt to remove any bit errors from the data bits. The three parity check bits are checked and if no uncorrectable bit errors are detected then the received bits are used to reconstruct the MBE model parameters for the current frame. Otherwise if an uncorrectable bit error is detected then the received bits for the current frame are ignored and the model parameters from the previous frame are repeated for the current frame. Techniques for addressing these bit error problems are discussed fully in U.S. Ser. No. 624,878 (now U.S. Pat. No. 5,226,084) and the PCT application.
The method described in the U.S. Ser. No. 624,878 (now U.S. Pat. No. 5,226,084) and the PCT application provide good results. However, improvements are desirable in efficiency and resistance to bit errors (robustness). For instance, the decoder and coder use as an input for making their predictions of spectral log amplitudes, the value .omega.. However, according to an efficient error correction method, not all of the bits of .omega. are protected. As mentioned above, only the six most significant bits are protected. Errors in the unprotected bits can result in significant errors in the predicted spectral log amplitudes, generated by the decoder, particularly for higher harmonics. However, these errors do not arise in the encoder. Thus, a difference arises between the predictions that the coder makes and the predictions that the decoder makes. This causes a degradation of the reconstructed signal. It is desireable to avoid this signal degradation.
It has also been determined that use of a constant value for the decay factor .gamma. has drawbacks. In frames having relatively few harmonic amplitudes i.e. L is rather small, it is not so important to save bits by using a highly differential form of parameter coding. This is because enough bits are available to specify the parameter more closely to its actual value. A fixed number of bits are available to specify a relatively small number of harmonic amplitudes, as compared to the maximum number of harmonic amplitudes that must on some occassions be specified by the same fixed number of bits. Although it has been stated in the PCT application that the decay factor .gamma. can be adaptively changed from segment to segment in order to improve performance, known methods have not proposed any such method by which performance can actually be improved.
Another drawback to methods such as are described above, is that the spectral envelope shape of timewise adjacent windowed frames typically bear many similarities and some differences. The method discussed in U.S. Ser. No. 624,878 (now U.S. Pat. No. 5,226,084) takes advantage of the similarities, to enable the decoder to predict the spectral envelope for the current frame. However, the differences between adjacent frames can minimize the beneficial effects, particularly due to the differential form of the known method. The principal similarity between adjacent frames is the shape of the curve h that connects each successive spectral amplitude M.sub.1 (n),M.sub.1+1 (n),M.sub.1+2 (n), etc. Thus, from one frame to the next, the shape of this curve is relatively similar. However, what is often different from one frame to the next, is the average value of the harmonic amplitudes. In other words, the curves, although of similar shapes, are displaced different distances from the origin.
Because the known method uses the previous frame to predict the current frame (which is essentially a differential sort of prediction), the predictions for the current frame will be based on the general location of the curve, or the distance from the origin. Since the current frame does not necessarily share the general location of the curve with its predecessor, the difference between the prediction and the actual value for the spectral amplitudes of the current frame can be quite large. Further, because the system is basically a differential coding system, as explained above, differential errors take a relatively long time to decay away. Since, it is an object of the prediction method to minimize the prediction residuals this effect is undesireable.