There are many environments where noisy conditions interfere with speech, such as the inside of a car, a street, or a busy office. The severity of background noise varies from the gentle hum of a fan inside a computer to a cacophonous babble in a crowded cafe. This background noise not only directly interferes with a listener's ability to understand a speaker's speech, but can cause further unwanted distortions if the speech is encoded or otherwise processed. Speech enhancement is an effort to process the noisy speech for the benefit of the intended listener, be it a human, speech recognition module, or anything else. For a human listener, it is desirable to increase the perceptual quality and intelligibility of the perceived speech, so that the listener understands the communication with minimal effort and fatigue.
It is usually the case that for a given speech enhancement scheme, a tradeoff must be made between the amount of noise removed and the distortion introduced as a side effect. If too much noise is removed, the resulting distortion can result in listeners preferring the original noise scenario to the enhanced speech. Preferences are based on more than just the energy of the noise and distortion: unnatural sounding distortions become annoying to humans when just audible, while a certain elevated level of "natural sounding" background noise is well tolerated. Residual background noise also serves to perceptually mask slight distortions, making its removal even more troublesome.
Speech enhancement can be broadly defined as the removal of additive noise from a corrupted speech signal in an attempt to increase the intelligibility or quality of speech. In most speech enhancement techniques, the noise and speech are generally assumed to be uncorrelated. Single channel speech enhancement is the simplest scenario, where only one version of the noisy speech is available, which is typically the result of recording someone speaking in a noisy environment with a single microphone.
FIG. 1 illustrates a speech enhancement setup for N noise sources for a single-channel system. For the single channel case illustrated in FIG. 1, exact reconstruction of the clean speech signal is usually impossible in practice. So speech enhancement algorithms must strike a balance between the amount of noise they attempt to remove and the degree of distortion that is introduced as a side effect. Since any noise component at the microphone cannot in general be distinguished as coming from a specific noise source, the sum of the responses at the microphone from each noise source is denoted as a single additive noise term.
Speech enhancement has a number of potential applications. In some cases, a human listener observes the output of the speech enhancement directly, while in others speech enhancement is merely the first stage in a communications channel and might be used as a preprocessor for a speech coder or speech recognition module. Such a variety of different application scenarios places very different demands on the performance of the speech enhancement module, so any speech enhancement scheme ought to be developed with the intended application in mind. Additionally, many well-known speech enhancement processes perform very differently with different speakers and noise conditions, making robustness in design a primary concern. Implementation issues such as delay and computational complexity are also considered.
Speech can be modeled as the output of an acoustic filter (i.e., the vocal tract) where the frequency response of the filter carries the message. Humans constantly change properties of the vocal tract to convey messages by changing the frequency response of the vocal tract.
The input signal to the vocal tract is a mixture of harmonically related sinusoids and noise. "Pitch" is the fundamental frequency of the sinusoids. "Formants" correspond to the resonant frequency(ies) of the vocal tract.
A speech coder works in the digital domain, typically deployed after an analog-to-digital (A/D) converter, to process a digitized speech input to the speech coder. The speech coder breaks the speech into constituent parts on an interval-by-interval basis. Intervals are chosen based on the amount of compression or complexity of the digitized speech. The intervals are commonly referred to as frames or sub-frames. The constituent parts include: (a) gain components to indicate the loudness of the speech; (b) spectrum components to indicate the frequency response of the vocal tract, where the spectrum components are typically represented by linear prediction coefficients ("LPCs") and/or cepstral coefficients; and (c) excitation signal components, which include a sinusoidal or periodic part, from which pitch is captured, and a noise-like part.
To make the gain components, gain is measured for an interval to normalize speech into a typical range. This is important to be able to run a fixed point processor on the speech.
In the time domain, linear prediction coefficients (LPCs) are a weighted linear sum of previous data used to predict the next datum. Cepstal coefficients can be determined from the LPCs, and vice versa. Cepstral coefficients can also be determined using a fast Fourier transform (FFT).
The bandwidth of a telephone channel is limited to 3.5 kHz. Upper (higher-frequency) formants can be lost in coding.
Noise affects speech coding, and the spectrum analysis can be adversely affected. The speech spectrum is flattened out by noise, and formants can be lost in coding. Calculation of the LPC and the cepstral coefficients can be affected.
The excitation signal (or "residual signal") components are determined after or separate from the gain components and the spectrum components by breaking the speech into a periodic part (the fundamental frequency) and a noise part. The processor looks back one (pitch) period (I/F) of the fundamental frequency (F) of the vocal tract to take the pitch, and makes the noise part from white noise. A sinusoidal or periodic part and a noise-like part are thus obtained.
Speech enhancement is needed because the more the speech coder is based on a speech production model, the less able it is to render faithful reproductions of non-speech sounds that are passed through the speech coder. Noise does not fit traditional speech production models. Non-speech sounds sound peculiar and annoying. The noise itself may be considered annoying by many people. Speech enhancement has never been shown to improve intelligibility but has often been shown to improve the quality of uncoded speech.
According to previous practice, speech enhancement was performed prior to speech coding, in a speech enhancement system separated from a speech coder/decoder, as shown in FIG. 2. With reference to FIG. 2, the speech enhancement module 6 is separated from the speech coder/decoder 8. The speech enhancement module 6 receives input speech. The speech enhancement module 6 enhances (e.g., removes noise from) the input speech and produces enhanced speech.
The speech coder/decoder 8 receives the already enhanced speech from the speech enhancement module 6. The speech coder/decoder 8 generates output speech based on the already-enhanced speech. The speech enhancement module 6 is not integral with the speech coder/decoder 8.
Previous attempts at speech enhancement and coding first cleaned up the speech as a whole, and then coded it, setting the amount of enhancement via "tuning".