This invention relates to methods for preserving the quality of speech or other acoustic signals when transmitted over a noisy channel.
Relevant publications include: J. L. Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, pp. 378-386, (discusses phase vocoder--frequency-based speech analysis-synthesis system); Quatieri, et al., "Speech Transformations Based on a Sinusoidal Representation", IEEE TASSP, Vol, ASSP34, No. 6, Dec. 1986, pp. 1449-1986, (discusses analysis-synthesis technique based on a sinusoidal representation); Griffin, "Multiband Excitation Vocoder", Ph.D. Thesis, M.I.T, 1987, (discusses an 8000 bps Multi-Band Excitation speech coder); Griffin, et al., "A High Quality 9.6 kbps Speech Coding System", Proc. ICASSP 86, pp. 125-128, Tokyo, Japan, Apr. 13-20, 1986, (discusses a 9600 bps Multi-Band Excitation speech coder); Griffin, et al., "A New Model-Based Speech Analysis/Synthesis System", Proc. ICASSP 85, pp. 513-516, Tampa, Fla., Mar. 26-29, 1985, (discusses Multi-Band Excitation speech model); Hardwick, "A 4.8 kbps Multi-Band Excitation Speech Coder", S. M. Thesis, M.I.T, May 1988, (discusses a 4800 bps Multi-Band Excitation speech coder); McAulay et al., "Mid-Rate Coding Based on a Sinusoidal Representation of Speech", Proc. ICASSP 85, pp. 945-948, Tampa, Fla., Mar. 26-29, 1985, (discusses the sinusoidal transform speech coder); Campbell et al., "The New 4800 bps Voice Coding Standard", Mil Speech Tech Conference, Nov. 1989, (discusses error correction in a U.S. Government speech coder); Campbell et al., "CELP Coding for Land Mobile Radio Applications", Proc. ICASSP 90, pp. 465-468, Albequerque, NM. Apr. 3-6, 1990, (discusses error correction in a U.S. Government speech coder); Levesque et al., Error-Control Techniques for Digital Communication, Wiley, 1985, (discusses error correction in general); Lin et al., Error Control Coding, Prentice-Hall, 1983, (discusses error correction in general);Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984, (discusses speech coding in general); Digital Voice Systems, Inc., "INMARSAT-M Voice Coder", Version 1.9, Nov. 18, 1992, (discusses 6.4 kbps IMBE.TM. speech coder for INMARSAT-M standard), Digital Voice Systems, Inc., "APCO/NASTD/Fed Project 25 Vocoder Description", Version 1.0, Dec. 1, 1992, (discusses 7.2 kbps IMBE.TM. speech coder for APCO/NASTD/Fed Project 25 standard) (attached hereto as Appendix A). The contents of these publications (including Appendix A) are incorporated herein by reference.
The problem of reliably transmitting digital data over noisy communication channels has a large number of applications, and as a result has received considerable attention in the literature. Traditional digital communication systems have relied upon error correction and detection methods to reliably transmit digital data over noisy channels. Sophisticated error coding techniques have been developed to systematically correct and detect bit errors which are introduced by the channel. Examples of commonly used error control codes (ECC's) include: Golay codes, Hamming codes, BCH codes, CRC codes, convolutional codes, Reed-Solomon codes, etc. . . . These codes all function by converting a set of information bits into a larger number of bits which are then transmitted across the channel. The increase in the number of bits can be viewed as a form of redundancy which enables the receiver to correct and/or detect up to a certain number of bit errors. In traditional ECC methods the number of bit errors which can be corrected/detected is a function of the amount of redundancy which is added to the data. This results in a tradeoff between reliability (the number of bit errors which can be corrected) versus useable data rate (the amount of information which can be transmitted after leaving room for redundancy). The digital communication designer typically performs a sophisticated system analysis to determine the best compromise between these two competing requirements.
The reliable transmission of speech or other acoustic signals over a communication channel is a related problem which is made more complicated by the need to first convert the analog acoustic signal into a digital representation. This is often done by digitizing the analog signal with an A-to-D convertor. In the case of speech, where an 8 bit A-to-D convertor may sample the signal at a rate of 8 kHz, the digital representation would require 64 kbps. If additional, redundant, information must be added prior to transmission across the channel, then the required channel data rate would be significantly greater than 64 kbps. For example, if the channel requires 50% redundancy for reliable transmission, then the required channel data rate would be 64+32=96 kbps. Unfortunately this data rate is beyond what is practical in many digital communication systems. Consequently some method for reducing the size of the digital representation is needed. This problem, commonly referred to as "compression", is performed by a signal coder. In the case of speech or other acoustic signals a system of this type is often referred to as a speech coder, voice coders, or simply a vocoder.
A modern speech coder performs a sophisticated analysis on the input signal, which can be viewed as either an analog signal or the output of an A-to-D converter. The result of this analysis is a compressed digital representation which may be as low as 100 bps. The actual compressed rate which is achieved is generally a function of the desired fidelity (i.e. speech quality) and the type of speech coder which is employed. Different types of speech coders have been designed to operate at high rates (16-64 kbps), mid-rates (2-16 kbps) and low rates (0-2 kbps). Recently, mid-rate speech coders have been the subject of renewed interest due to the increase in mobile communication services (cellular, satellite telephony, land mobile radio, in-flight phones etc . . . ). These applications typically require high quality speech at mid-rates. In additions these applications are all subject to significant channel degradations including in high bit error rates (BER) of 1-10% and multipath fading. (Note the problem of bit errors is present to some extent in all digital communication and storage applications. The mobile communication example is presented due to the severity of the problem in the mobile environment)
As discussed above, there are numerous speech coding methods which have been employed in the past. One class of speech coders which have been extensively studied and used in practice is based on an underlying model of speech. Examples from this class of vocoders include linear prediction vocoders, homomorphic vocoders, sinusoidal transform coders, multi-band excitation speech coders, improved multi-band excitation speech coders and channel vocoders. In these vocoders, speech is characterized on a short-time basis through a set of model parameters. The model parameters typcially consist of some combination of voiced/unvoiced decisions, voiced/unvoiced probability measure, pitch period, fundamental frequency, gain, spectral envelope parameters and residual or error parameters. For this class of speech coders, speech is analyzed by first segmenting speech using a window such as a Hamming window. Then, for each segment of speech, the model parameters are estimated and quantized.
In noisy digital communication systems, the traditional approach is to protect the quantized model parameters with some form of ECC. The redundant information associated with the ECC is used by the receiver to correct and/or detect bit errors introduced by the channel. The receiver then reconstructs the model parameters and then proceeds to synthesize a digital speech signal which is suitable for playback through a D-to-A convertor and a speaker. The inclusion of error control capability allows the receiver to reduce the distortion and other artifacts which would be introduced into the synthesized speech due to the presence of bit errors in the received data. Unfortunately, with any error control code, there is some probability that too many errors will be introduced for the receiver to correct. In this case the remaining bit errors will affect the reconstruction of the model parameters and possibly introduce significant degradations into the synthesized speech. This problem can be lessened by either including additional error control codes, or by including additional error detection capability which can detect errors which cannot be corrected. These traditional approaches require additional redundancy and hence further increase the channel data rate which is required to transmit a fixed amount of information. This requirement is a disadvantage, since in most applications it is desirable to minimize the total number of bits which are transmitted (or stored).
The invention described herein applies to many different digital communication systems, some of which contain speech coders. Examples of speech coders which may be contained in such a communication system include but are not limited to linear predictive speech coders, channel vocoders, homomorphic vocoders, sinusoidal transform coders, multi-band excitation speech coders and improved multiband excitation (IMBE.TM.) speech coders. For the purpose of describing the details of this invention, we have focussed on a digital communication system containing the IMBE.TM. speech coder. This particular speech coder has been standardized at 6.4 kbps for use over the INMARSAT-M (International Marine Satellite Organization) and OPTUS Mobilesat satellite communication system, and which has been selected at 7.2 kbps for use in the APCO/NASTD/Fed Project 25 North American land mobile radio standard.
The IMBE.TM. coder uses a robust speech model which is referred to as the Multi-Band Excitation (MBE) speech model. The MBE speech model was developed by Griffin and Lim in 1984. This model uses a more flexible representation of the speech signal than traditional speech models. As a consequence it is able to produce more natural sounding speech, and it is more robust to the presence of acoustic background noise. These properties have caused the MBE speech model to be used extensively for high quality mid-rate speech coding.
Let s(n) denote a discrete speech signal obtained by sampling an analog speech signal. In order to focus attention on a short segment of speech over which the model parameters are assumed to be constant, the signal s(n) is multiplied by a window w(n) to obtain a windowed speech segment or frame, s.sub.w (n). The speech segment s.sub.w (n) is modelled as the response of a linear filter h.sub.w (n) to some excitation signal e.sub.w (n). Therefore, S.sub.w (.omega.), the Fourier Transform of s.sub.w (n), can be expressed as EQU S.sub.w (.omega.)=H.sub.w (.omega.)E.sub.w (.omega.) (1)
where H.sub.w (.omega.) and E.sub.w (.omega.) are the Fourier Transforms of h.sub.w (n) and e.sub.w (n), respectively. The spectrum H.sub.w (.omega.) is often referred to as the spectral envelope of the speech segment.
In traditional speech models speech is divided into two classes depending upon whether the signal is mostly periodic (voiced) or mostly noise-like (unvoiced). For voiced speech the excitation signal is a periodic impulse sequence, where the distance between impulses is the pitch period. For unvoiced speech the excitation signal is a white noise sequence.
In traditional speech models each speech segment is classified as either entirely voiced or entirely unvoiced. In contrast the MBE speech model divides the excitation spectrum into a number of non-overlapping frequency bands and makes a voiced or unvoiced (V/UV) decision for each frequency band. This approach allows the excitation signal for a particular speech segment to be a mixture of periodic (voiced) energy and aperiodic (unvoiced) energy.
This added flexibility in the modelling of the excitation signal allows the MBE speech model to produce high quality speech and to be robust to the presence of background noise.
Speech coders based on the MBE speech model estimate a set of model parameters for each segment of speech. The MBE model parameters consist of a fundamental frequency, a set of V/UV decisions which characterize the excitation signal, and a set of spectral amplitudes which characterize the spectral envelope. Once the MBE model parameters have been estimated for each segment, they are quantized, protected with ECC and transmitted to the decoder. The decoder then performs error control decoding to correct and/or detect bit errors. The resulting bits are then used to reconstruct the MBE model parameters which are in turn used to synthesize a speech signal suitable for playback through a D-to-A convertor and a conventional speaker.