There are many environments where noisy conditions interfere with speech, such as the inside of a car, a street, or a busy office. The severity of background noise varies from the gentle hum of a fan inside a computer to a cacophonous babble in a crowded cafe. This background noise not only directly interferes with a listener""s ability to understand a speaker""s speech, but can cause further unwanted distortions if the speech is encoded or otherwise processed. Speech enhancement is an effort to process the noisy speech for the benefit of the intended listener, be it a human, speech recognition module, or anything else. For a human listener, it is desirable to increase the perceptual quality and intelligibility of the perceived speech, so that the listener understands the communication with minimal effort and fatigue.
It is usually the case that for a given speech enhancement scheme, a tradeoff must be made between the amount of noise removed and the distortion introduced as a side effect. If too much noise is removed, the resulting distortion can result in listeners preferring the original noise scenario to the enhanced speech. Preferences are based on more than just the energy of the noise and distortion: unnatural sounding distortions become annoying to humans when just audible, while a certain elevated level of xe2x80x9cnatural soundingxe2x80x9d background noise is well tolerated. Residual background noise also serves to perceptually mask slight distortions, making its removal even more troublesome.
Speech enhancement can be broadly defined as the removal of additive noise from a corrupted speech signal in an attempt to increase the intelligibility or quality of speech. In most speech enhancement techniques, the noise and speech are generally assumed to be uncorrelated. Single channel speech enhancement is the simplest scenario, where only one version of the noisy speech is available, which is typically the result of recording someone speaking in a noisy environment with a single microphone.
FIG. 1 illustrates a speech enhancement setup for N noise sources for a single-channel system. For the single channel case illustrated in FIG. 1, exact reconstruction of the clean speech signal is usually impossible in practice. So speech enhancement algorithms must strike a balance between the amount of noise they attempt to remove and the degree of distortion that is introduced as a side effect. Since any noise component at the microphone cannot in general be distinguished as coming from a specific noise source, the sum of the responses at the microphone from each noise source is denoted as a single additive noise term.
Speech enhancement has a number of potential applications. In some cases, a human listener observes the output of the speech enhancement directly, while in others speech enhancement is merely the first stage in a communications channel and might be used as a preprocessor for a speech coder or speech recognition module. Such a variety of different application scenarios places very different demands on the performance of the speech enhancement module, so any speech enhancement scheme ought to be developed with the intended application in mind. Additionally, many well-known speech enhancement processes perform very differently with different speakers and noise conditions, making robustness in design a primary concern. Implementation issues such as delay and computational complexity are also considered.
Speech can be modeled as the output of an acoustic filter (i.e., the vocal tract) where the frequency response of the filter carries the message. Humans constantly change properties of the vocal tract to convey messages by changing the frequency response of the vocal tract.
The input signal to the vocal tract is a mixture of harmonically related sinusoids and noise. xe2x80x9cPitchxe2x80x9d is the fundamental frequency of the sinusoids. xe2x80x9cFormantsxe2x80x9d correspond to the resonant frequency(ies) of the vocal tract.
A speech coder works in the digital domain, typically deployed after an analog-to-digital (A/D) converter, to process a digitized speech input to the speech coder. The speech coder breaks the speech into constituent parts on an interval-by-interval basis. Intervals are chosen based on the amount of compression or complexity of the digitized speech. The intervals are commonly referred to as frames or sub-frames. The constituent parts include: (a) gain components to indicate the loudness of the speech; (b) spectrum components to indicate the frequency response of the vocal tract, where the spectrum components are typically represented by linear prediction coefficients (xe2x80x9cLPCsxe2x80x9d) and/or cepstral coefficients; and (c) excitation signal components, which include a sinusoidal or periodic part from which pitch is captured, and a noise-like part.
To make the gain components, gain is measured for an interval to normalize speech into a typical range. This is important to be able to run a fixed point processor on the speech.
In the time domain, linear prediction coefficients (LPCs) are a weighted linear sum of previous data used to predict the next datum. Cepstral coefficients can be determined from the LPCs, and vice versa. Cepstral coefficients can also be determined using a fast Fourier transform (FFT).
The bandwidth of a telephone channel is limited to 3.5 kHz. Upper (higher-frequency) formants can be lost in coding.
Noise affects speech coding, and the spectrum analysis can be adversely affected. The speech spectrum is flattened out by noise, and formants can be lost in coding. Calculation of the LPC and the cepstral coefficients can be affected.
The excitation signal (or xe2x80x9cresidual signalxe2x80x9d) components are determined after or separate from the gain components and the spectrum components by breaking the speech into a periodic part (the fundamental frequency) and a noise part. The processor looks back one (pitch) period (1/F) of the fundamental frequency (F) of the vocal tract to take the pitch, and makes the noise part from white noise. A sinusoidal or periodic part and a noise-like part are thus obtained.
Speech enhancement is needed because the more the speech coder is based on a speech production model, the less able it is to render faithful reproductions of non-speech sounds that are passed through the speech coder. Noise does not fit traditional speech production models. Non-speech sounds sound peculiar and annoying. The noise itself may be considered annoying by many people. Speech enhancement has never been shown to improve intelligibility but has often been shown to improve the quality of uncoded speech.
According to previous practice, speech enhancement was performed prior to speech coding, in a speech enhancement system separated from a speech coder/decoder, as shown in FIG. 2. With reference to FIG. 2, the speech enhancement module 6 is separated from the speech coder/decoder 8. The speech enhancement module 6 receives input speech. The speech enhancement module 6 enhances (e.g., removes noise from) the input speech and produces enhanced speech.
The speech coder/decoder 8 receives the already enhanced speech from the speech enhancement module 6. The speech coder/decoder 8 generates output speech based on the already-enhanced speech. The speech enhancement module 6 is not integral with the speech coder/decoder 8.
Previous attempts at speech enhancement and coding first cleaned up the speech as a whole, and then coded it, setting the amount of enhancement via xe2x80x9ctuningxe2x80x9d.
According to an exemplary embodiment of the invention, a system for enhancing and coding speech performs the steps of receiving digitized speech and enhancing the digitized speech to extract component parts of the digitized speech. The digitized speech is enhanced differently for each of the component parts extracted.
According to an aspect of the invention, an apparatus for enhancing and coding speech includes a speech coder that receives digitized speech. A spectrum signal processor within the speech coder determines spectrum components of the digitized speech. An excitation signal processor within the speech coder determines excitation signal components of the digitized speech. A first speech enhancement system within the speech coder processes the spectrum components. A second speech enhancement system within the speech coder processes the excitation signal components.
Other features and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the features of the invention.