This invention relates to methods for quantizing speech and for preserving the quality of speech during the presence of bit errors.
Relevant publications include: J. L. Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, pp. 378-386, (discusses phase vocoder--frequency-based speech analysis-synthesis system); Quatieri, et al., "Speech Transformations Based on a Sinusoidal Representation", IEEE TASSP, Vol, ASSP34, No. 6. Dec. 1986, pp. 1449-1986, (discusses analysis-synthesis technique based on a sinusoidal representation); Griffin, "Multiband Excitation Vocoder", Ph.D. Thesis, M.I.T., 1987, (discusses an 8000 bps Multi-Band Excitation speech coder); Griffin, et al., "A High Quality 9.6 kbps Speech Coding System", Proc. ICASSP 86, pp. 125-128, Tokyo, Japan, Apr. 13-20, 1986, (discusses a 9600 bps Multi-Band Excitation speech coder); Griffin, et al., "A New Model-Based Speech Analysis/Synthesis System", Proc. ICASSP 85, pp. 513-516, Tampa, Fla., Mar. 26-29, 1985, (discusses Multi-Band Excitation speech model); Hardwick, "A 4.8 kbps Multi-Band Excitation Speech Coder", S. M. Thesis, M.I.T., May 1988, (discusses a 4800 bps Multi-Band Excitation speech coder); McAulay et al., "Mid-Rate Coding Based on a Sinusoidal Representation of Speech", Proc. ICASSP 85, pp. 945-948, Tampa, Fla., Mar. 26-29, 1985, (discusses speech coding based on a sinusoidal representation); Campbell et al., "The New 4800 bps Voice Coding Standard", Mil Speech Tech Conference, Nov. 1989, (discusses error correction in low rate speech coders); Campbell et al., "CELP Coding for Land Mobile Radio Applications", Proc. ICASSP 90, pp. 465-468, Albequerque, N.M. Apr. 3-6, 1990, (discusses error correction in low rate speech coders); Levesque et al., Error-Control Techniques for Digital Communication, Wiley, 1985, pp. 157-170, (discusses error correction in general); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984 (discusses quantization in general); Makhoul, et.al. "Vector Quantization in Speech Coding", Proc. IEEE, 1985, pp. 1551-1558 (discusses vector quantization in general). The contents of these publications are incorporated herein by reference.
The problem of speech coding (compressing speech into a small number of bits) has a large number of applications, and as a result has received considerable attention in the literature. One class of speech coders (vocoders) which have been extensively studied and used in practice is based on an underlying model of speech. Examples from this class of vocoders include linear prediction vocoders, homomorphic vocoders, and channel vocoders. In these vocoders, speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for unvoiced sounds. For this class of vocodes, speech is analyzed by first segmenting speech using a window such as a Hamming window. Then, for each segment of speech, the excitation parameters and system parameters are estimated and quantized. The excitation parameters consist of the voiced/unvoiced decision and the pitch period. The system parameters consist of the spectral envelope or the impulse response of the system. In order to reconstruct speech, the quantized excitation parameters are used to synthesize an excitation signal consisting of a periodic impulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the quantized system parameters.
Even though vocoders based on this underlying speech model have been quite successful in producing intelligible speech, they have not been successful in producing high-quality speech. As a consequence, they have not been widely used for high-quality speech coding. The poor quality of the reconstructed speech is in part due to the inaccurate estimation of the model parameters and in part due to limitations in the speech model.
A new speech model, referred to as the Multi-Band Excitation (MBE) speech model, was developed by Griffin and Lim in 1984. Speech coders based on this new speech model were developed by Griffin and Lim in 1986, and they were shown to be capable of producing high quality speech at rates above 8000 bps (bits per second). Subsequent work by Hardwick and Lim produced a 4800 bps MBE speech coder which was also capable of producing high quality speech. This 4800 bps speech coder used more sophisticated quantization techniques to achieve similar quality at 4800 bps that earlier MBE speech codes had achieved at 8000 bps.
The 4800 bps MBE speech coder used a MBE analysis/synthesis system to estimate the MBE speech model parameters and to synthesize speech from the estimated MBE speech model parameters. A discrete speech signal, denoted by s(n), is obtained by sampling an analog speech signal. This is typically done at an 8 kHz, sampling rate, although other sampling rates can easily be accommodated through a straightforward change in the various system parameters. The system divides the discrete speech signal into small overlapping segments or segments by multiplying s(n) with a window w(n) (such as a Hamming window or a Kaiser window) to obtain a windowed signal s.sub.w (n). Each speech segment is then analyzed to obtain a set of MBE speech model parameters which characterized that segment. The MBE speech model parameters consist of a fundamental frequency, which is equivalent to the pitch period, a set of voiced/unvoiced decisions, a set of spectral amplitudes, and optionally a set of spectral phases. These model parameters are than quantized using a fixed number of bits for each segment. The resulting bits can then be used to reconstruct the speech signal, by first reconstructing the MBE model parameters from the bits and then synthesizing the speech from the model parameters. A block diagram of a typical MBE speech coder is shown in FIG. 1.
The 4800 bps MBE speech coder required the use of a sophisticated technique to quantize the spectral amplitudes. For each speech segment the number of bits which could be used to quantize the spectral amplitudes varied between 50 and 125 bits. In addition the number of spectral amplitudes for each segment varies between 9 and 60. A quantization method was devised which could efficiently represent all of the spectral amplitudes with the number of bits available for each segment. Although this spectral amplitude quantization method was designed for use in an MBE speech coder the quantization techniques are equally useful in a number of different speech coding methods, such as the Sinusoidal Transform Coder and the Harmonic Coder. For a particular speech segment, L denotes the number of spectral amplitudes in that segment. The value of L is derived from the fundamental frequency, .omega..sub.0, according to the relationship. ##EQU1## where 0.ltoreq..beta..ltoreq.1.0 determines the bandwidth relative to half the sampling rate. The function [x], referred to in Equation (1), is equal to the largest integer less than or equal to x. The L spectral amplitudes are denoted by M.sub.l for 1.ltoreq.l.ltoreq.L, where M.sub.1 is the lowest frequency spectral amplitude and M.sub.L is the highest frequency spectral amplitude.
The spectral amplitudes for the current speech segment are quantized by first calculating a set of prediction residuals which indicate the amount the spectral amplitudes have changed between the current speech segment and the previous speech segment. If L.sup.0 denotes the number of spectral amplitudes in the current speech segment and L.sup.-1 denotes the number of spectral amplitudes in the previous speech segment, then the prediction residuals, T.sub.l for 1.ltoreq.l.ltoreq.L.sup.0 are given by, ##EQU2## where M.sub.l.sup.0 denotes the spectral amplitudes of the current speech segment and M.sub.1.sup.-1 denotes the quantized spectral amplitudes of the previous speech segment. The constant .gamma. is typically equal to 0.7, however any value in the range 0.ltoreq..gamma..ltoreq.1 can be used.
The prediction residuals are divided into blocks of K elements, where the value of K is typically in the range 4.ltoreq.K.ltoreq.12. If L is not evenly divisible by K, then the highest frequency block will contain less than K elements. This is shown in FIG. 2 for L=34 and K=8.
Each of the prediction residual blocks is then transformed using a Discrete Cosine Transform (DCT) defined by, ##EQU3## The length of the transform for each block, J, is equal to the number of elements in the block. Therefore, all but the highest frequency block are transformed with a DCT of length K, while the length of the DCT for the highest frequency block is less than or equal to K. Since the DCT is an invertible transform, the L DCT coefficients completely specify the spectral amplitude prediction residuals for the current segment.
The total number of bits available for quantizing the spectral amplitude is divided among the DCT coefficients according to a bit allocation rule. This rule attempts to give more bits to the perceptually more important low-frequency blocks, than to the perceptually less important high-frequency blocks. In addition the bit allocation rule divides the bits within a block to the DCT coefficients according to their relative long-term variances. This approach matches the bit allocation with the perceptual characteristics of speech and with the quantization properties of the DCT.
Each DCT coefficient is quantized using the number of bits specified by the bit allocation rule. Typically, uniform quantization is used, however non-uniform or vector quantization can also be used. The step size for each quantizer is determined from the long-term variance of the DCT coefficients and from the number of bits used to quantize each coefficient. Table 1 shows the typical variation in the step size as a function of the number of bits, for a long-term variance equal to .sigma..sup.2.
TABLE 1 ______________________________________ Step Size of Uniform Quantizers Number of Bits Step Size ______________________________________ 1 1.2.sigma. 2 .85.sigma. 3 .65.sigma. 4 .42.sigma. 5 .28.sigma. 6 .14.sigma. 7 .07.sigma. 8 .035.sigma. 9 .0175.sigma. 10 .00875.sigma. 11 .00438.sigma. 12 .00219.sigma. 13 .00110.sigma. 14 .000550.sigma. 15 .000275.sigma. 16 .000138.sigma. ______________________________________
One each DCT coefficient has been quantized using the number of bits specified by the bit allocation rule, the binary representation can be transmitted, stored, etc., depending on the application. The spectral amplitudes can be reconstructed from the binary representation by first reconstructing the quantized DCT coefficients for each block, performing the inverse DCT on each block, and then combining with the quantized spectral amplitudes of the previous segment using the inverse of Equation (2). The inverse DCT is given by,
TABLE 1 ______________________________________ Step Size of Uniform Quantizers Number of Bits Step Size ______________________________________ 1 1.2.sigma. 2 .85.sigma. 3 .65.sigma. 4 .42.sigma. 5 .28.sigma. 6 .14.sigma. 7 .07.sigma. 8 .035.sigma. 9 .0175.sigma. 10 .00875.sigma. 11 .00438.sigma. 12 .00219.sigma. 13 .00110.sigma. 14 .000550.sigma. 15 .000275.sigma. 16 .000138.sigma. ______________________________________
where the length, J, for each block is chosen to be the number of elements in that block, .alpha.(j) is given by, ##EQU4##
One potential problem with the 4800 bps MBE speech coder is that the perceived quality of the reconstructed speech may be significantly reduced if bit errors are added to the binary representation of the MBE model parameters. Since bit errors exits in many speech coder applications, a robust speech coder must be able to correct, detect and/or tolerate bit errors. One technique which has been found to be very successful is to use error correction codes in the binary representation of the model parameters. Error correction codes allow infrequent bit errors to be corrected, and they allow the system to estimate the error rate. The estimate of the error rate can then be used to adaptively process the model parameters to reduce the effect of any remaining bit errors. Typically, the error rate is estimated by counting the number of errors corrected (or detected) by the error correction codes in the current segment, and then using this information to update the current estimate of error rate. For example if each segment contains a (23,12) Golay code which can correct three errors out of the 23 bits, and .epsilon..sub.T denotes the number of errors (0-3) which were corrected in the current segment, then the current estimate of the error rate, .epsilon..sub.R, is updated according to: ##EQU5## where .beta. is a constant in the range 0.ltoreq..beta..ltoreq.1 which controls the adaptability of .epsilon..sub.R.