While speech is analog in nature, often it is necessary to transmit it over a digital communications channel or to store it on a digital medium. In this case, the speech signal must be sampled and encoded by one of a variety of methods or techniques. Each encoding technique has an associated decoder that synthesizes or reconstructs the speech from the transmitted or stored values. The combination of an encoder and decoder is often referred to as a codec or coder.
There are many well-known techniques in the art of speech coding. These fall broadly into two categories: waveform coding and parametric coding. Waveform coders attempt to quantize and encode the speech signal itself. These techniques are used in most modern public telephone networks and produce high-quality speech at relatively low complexity. However, waveform coders are not particularly efficient, meaning that a relatively large amount of information must be transmitted or stored to achieve a desired quality in the reconstructed speech. This may not be acceptable in some applications where transmission bandwidth or storage capacity is limited.
In general, parametric coders are able to produce a desired speech quality at lower information (or "bit") rates than waveform coders. Each type of parametric coder assumes a particular model for the speech signal, with the model consisting of a number of parameters. In most cases, the parametric model is highly optimized to human speech. The parametric coder receives samples of the speech signal, fits the samples to the model, then quantizes and encodes the values for the model parameters. Transmitting parameter values rather than waveform values enables the efficient operation of parametric coders. However, the optimization of the model for voice can create problems when signals other than or in addition to voice are present. For instance, many parametric coders produce annoying audible artifacts when presented with background noise from a car environment.
Since these artifacts in the reconstructed speech may be unacceptable to a listener, measures must be taken to eliminate or at least mitigate the background noise. One approach is to use a noise suppressor device as a preprocessor to the speech encoder. The noise suppressor receives samples of the noisy speech signal from a microphone or other device, processes the samples, then outputs the speech samples with reduced levels of the background noise. The output samples are in the time domain, and thus can be input to the speech encoder or sent directly to a digital-to-analog converter (DAC) device to synthesize audible speech.
One common method for noise suppression is spectral subtraction, in which models of the background noise and of the composite (or speech-plus-noise) signals are used to construct a linear noise suppression filter. These models typically are maintained in the frequency domain as power spectral densities (PSDs). The noise and composite models are updated when speech is absent and present, respectively, as indicated by a voice activity detector (VAD). The noise suppression input samples are transformed to the frequency domain, the noise suppression filter is applied, and the samples are transformed back to the time domain before being output to speech encoder or DAC.
Parametric voice encoders can be further divided into time-domain and frequency-domain types. Most time-domain parametric encoders are based on a model containing linear prediction coefficients (LPCs). A representative frequency-domain type is the Multi-Band Excitation (MBE) encoder, which includes the well-known IMBE.TM. and AMBE.TM. methods. MBE-class encoders utilize a frequency-domain model that includes parameters such as the fundamental frequency (or pitch), a set of spectral magnitudes evaluated at the fundamental and its harmonics, and a set of Boolean values classifying the energy as voiced or unvoiced in each frequency band. Typically, there is a one-to-one correspondence between the respective spectral magnitudes and voiced/unvoiced decisions. MBE-class encoders compute values for the parameters by analysis of a group or frame of samples of the speech signal. The parameter values are then quantized and encoded for transmission or storage.
After close inspection, there are clear similarities between spectral subtraction techniques and frequency-domain voice encoders such as the MBE class described above. Both utilize frequency-domain models; in fact, these models may be very similar depending on the frequencies at which they are evaluated and the model format. Also, both functions disregard the phase of the input signal. The phase of the spectral subtraction input and output are identical, while the frequency-domain decoder may impose arbitrary phase since this information is not in the transmitted model parameters. Finally, both may utilize a VAD, since it may be advantageous to operate the encoder in discontinuous transmission (DTX) mode. The object of the present invention is to exploit these similarities by incorporating spectral subtraction noise suppression in a frequency-domain speech encoder. Such a technique or device has significantly lower complexity than implementing the noise suppressor as a speech encoder preprocessor.