The invention relates to electronic devices, and, more particularly, to speech analysis and synthesis devices and systems.
Human speech consists of a stream of acoustic signals with frequencies ranging up to roughly 20 KHz; but the band of 100 Hz to 5 KHz contains the bulk of the acoustic energy. Telephone transmission of human speech originally consisted of conversion of the analog acoustic signal stream into an analog electrical voltage signal stream (e.g., microphone) for transmission and reconversion to an acoustic signal stream (e.g., loudspeaker) for reception.
The advantages of digital electrical signal transmission led to a conversion from analog to digital telephone transmission beginning in the 1960s. Typically, digital telephone signals arise from sampling analog signals at 8 KHz and nonlinearly quantizing the samples with 8-bit codes according to the μ-law (pulse code modulation, or PCM). A clocked digital-to-analog converter and companding amplifier reconstruct an analog electrical signal stream from the stream of 8-bit samples. Such signals require transmission rates of 64 Kbps (kilobits per second). Many communications applications, such as digital cellular telehone, cannot handle such a high transmission rate, and this has inspired various speech compression methods.
The storage of speech information in analog format (e.g., on magnetic tape in a telephone answering machine) can likewise be replaced with digital storage. However, the memory demands can become overwhelming: 10 minutes of 8-bit PCM sampled at 8 KHz would require about 5 MB (megabytes) of storage. This demands speech compression analogous to digital transmission compression.
One approach to speech compression models the physiological generation of speech and thereby reduces the necessary information transmitted or stored. In particular, the linear speech production model presumes excitation of a variable filter (which roughly represents the vocal tract) by either a pulse train for voiced sounds or white noise for unvoiced sounds followed by amplification or gain to adjust the loudness. The model produces a stream of sounds simply by periodically making a voiced/unvoiced decision plus adjusting the filter coefficients and the gain. Generally, see Markel and Gray, Linear Prediction of Speech (Springer-Verlag 1976).
More particularly, the linear prediction method partitions a stream of speech samples s(n) into “frames” of, for example, 180 successive samples (22.5 msec intervals for a 8 KHz sampling rate); and the samples in a frame then provide the data for computing the filter coefficients for use in coding and synthesis of the sound associated with the frame. Each frame generates coded bits for the linear prediction filter coefficients (LPC), the pitch, the voiced/unvoiced decision, and the gain. This approach of encoding only the model parameters represents far fewer bits than encoding the entire frame of speech samples directly, so the transmission rate may be only 2.4 Kbps rather than the 64 Kbps of PCM. In practice, the LPC coefficients must be quantized for transmission, and the sensitivity of the filter behavior to the quantization error has led to quantization based on the Line Spectral Frequencies (LSF) representation.
To improve the sound quality, further information may be extracted from the speech, compressed and transmitted or stored along with the LPC coefficients, pitch, voicing, and gain. For example, the codebook excitation linear prediction (CELP) method first analyzes a speech frame to find the LPC filter coefficients, and then filters the frame with the LPC filter. Next, CELP determines a pitch period from the filtered frame and removes this periodicity with a comb filter to yield a noise-looking excitation signal. Lastly, CELP encodes the excitation signals using a codebook. Thus CELP transmits the LPC filter coefficients, pitch, gain, and the codebook index of the excitation signal.
The advent of digital cellular telephones has emphasized the role of noise suppression in speech processing, both coding and recognition. Customer expectation of high performance even in extreme car noise situations plus the demand to move to progressively lower data rate speech coding in order to accommodate the ever-increasing number of cellular telephone customers have contributed to the importance of noise suppression. While higher data rate speech coding methods tend to maintain robust performance even in high noise environments, that typically is not the case with lower data rate speech coding methods. The speech quality of low data rate methods tends to degrade drastically with high additive noise. Noise supression to prevent such speech quality losses is important, but it must be achieved without introducing any undesirable artifacts or speech distortions or any significant loss of speech intelligibility. These performance goals for noise suppression have existed for many years, and they have recently come to the forefront due to digital cellular telephone application.
FIG. 1a schematically illustrates an overall system 100 of modules for speech acquisition, noise suppression, analysis, transmission/storage, synthesis, and playback. A microphone converts sound waves into electrical signals, and sampling analog-to-digital converter 102 typically samples at 8 KHz to cover the speech spectrum up to 4 KHz. System 100 may partition the stream of samples into frames with smooth windowing to avoid discontinuities. Noise suppression 104 filters a frame to suppress noise, and analyzer 106 extracts LPC coefficients, pitch, voicing, and gain from the noise-suppressed frame for transmission and/or storage 108. The transmission may be any type used for digital information transmission, and the storage may likewise be any type used to store digital information. Of course, types of encoding analysis other than LPC could be used. Synthesizer 110 combines the LPC coefficients, pitch, voicing, and gain information to synthesize frames of sampled speech which digital-to-analog convertor (DAC) 112 converts to analog signals to drive a loudspeaker or other playback device to regenerate sound waves.
FIG. 1b shows an analogous system 150 for voice recognition with noise suppression. The recognition analyzer may simply compare input frames with frames from a database or may analyze the input frames and compare parameters with known sets of parameters. Matches found between input frames and stored information provides recognition output.
One approach to noise suppression in speech employs spectral subtraction and appears in Boll, Suppression of Acoustic Noise in Speech Using Spectral Subtraction, 27 IEEE Tr.ASSP 113 (1979), and Lim and Oppenheim, Enhancement and Bandwidth Compression of Noisy Speech, 67 Proc.IEEE 1586 (1979). Spectral subtraction proceeds roughly as follows. Presume a sampled speech signal s(j) with uncorrelated additive noise n(j) to yield an observed windowed noisy speech y(j)=s(j)+n(j). These are random processes over time. Noise is assumed to be a stationary process in that the process's autocorrelation depends only on the difference of the variables; that is, there is a function rN(.) such that:E{n(j)n(i)}=rN(i−j)where E is the expectation. The Fourier transform of the autocorrelation is called the power spectral density, PN(ω). If speech were also a stationary process with autocorrelation rS(j) and power spectral density PS(ω), then the power spectral densities would add due to the lack of correlation:PY(ω)=PS(ω)+PN(ω)Hence, an estimate for PSs(ω), and thus s(j), could be obtained from the observed noisy speech y(j) and the noise observed during intervals of (presumed) silence in the observed noisy speech. In particular, take PY(ω) as the squared magnitude of the Fourier transform of y(j) and PN(ω) as the squared magnitude of the Fourier transform of the observed noise.
Of course, speech is not a stationary process, so Lim and Oppenheim modified the approach as follows. Take s(j) not to represent a random process but rather to represent a windowed speech signal (that is, a speech signal which has been multiplied by a window function), n(j) a windowed noise signal, and y(j) the resultant windowed observed noisy speech signal. Then Fourier transforming and multiplying by complex conjugates yields:|Y(ω)|2=|S(ω)|2+|N(ω)|2+2Re{S(ω)N(ω)*}For ensemble averages the last term on the righthand side of the equation equals zero due to the lack of correlation of noise with the speech signal. This equation thus yields an estimate, S^(ω), for the speech signal Fourier transform as:|S^(ω)|2=|Y(ω)|2−E{|N(ω)|2}
This resembles the preceding equation for the addition of power spectral densities.
An autocorrelation approach for the windowed speech and noise signals simplifies the mathematics. In particular, the autocorrelation for the speech signal is given byrS(j)=ΣiS(i)S(i+j),with similar expressions for the autocorrelation for the noisy speech and the noise. Thus the noisy speech autocorrelation is:rY(j)=rS(j)+rN(j)+cSN(j)+SN(−j)where cSN(.) is the cross correlation of s(j) and n(j). But the speech and noise signals should be uncorrelated, so the cross correlations can be approximated as 0. Hence, rY(j)=rS(j)+rN(j). And the Fourier transforms of the autocorrelations are just the power spectral densities, soPY(ω)=PS(ω)+PN(ω)
Of course, PY(ω) equals |Y(ω)|2 with Y(ω) the Fourier transform of y(j) due to the autocorrelation being just a convolution with a time-reversed variable.
The power spectral density PN(ω) of the noise signal can be estimated by detection during noise-only periods, so the speech power spectral estimate becomes|S^(ω)|2=|Y(ω)|2−|N(ω)|2−PY(ω)−PN(ω)which is the spectral subtraction.
The spectral subtraction method can be interpreted as a time-varying linear filter H(ω) so that S^(ω)=H(ω)Y(ω) which the foregoing estimate then defines as:H(ω)2=[PY(ω)−PN(ω)]/PY(ω)The ultimate estimate for the frame of windowed speech, s^(j), then equals the inverse Fourier transform of S^(ω), and then combining the estimates from successive frames (“overlap add”) yields the estimated speech stream.
This spectral subtraction can attenuate noise substantially, but it has problems including the introduction of fluctuating tonal noises commonly referred to as musical noises.
The Lim and Oppenheim article also describes an alternative noise suppression approach using noncausal Wiener filtering which minimizes the mean-square error. That is, again S^(ω)=H(ω)Y(ω) but with H(ω) now given by:H(ω)=PS(ω)/[PS(ω)+PN(ω)]This Wiener filter generalizes to:H(ω)=[PS(ω)/[PS(ω)+αPN(ω)]]βwhere constants α and β are called the noise suppression factor and the filter power, respectively. Indeed, α=1 and β=½ leads to the spectral subtraction method in the following.
A noncausal Wiener filter cannot be directly applied to provide an estimate for s(j) because speech is not stationary and the power spectral density PS(ω) is not known. Thus approximate the noncausal Wiener filter by an adaptive generalized Wiener filter which uses the squared magnitude of the estimate S^(ω) in place of PS(ω):H(ω)=(|S^(ω)|2/[|S^(ω)|2+αE{|N(ω)|2}])β
Recalling S^(ω)=H(ω)Y(ω) and then solving for |S^(ω)| in the β=½ case yields:|S^(ω)|=[|Y(ω)|2−αE{|N(ω)|2}]1/2 which just replicates the spectral subtraction method when α=1.
However, this generalized Wiener filtering has problems including how to estimate S^, and estimators usually apply an iterative approach with perhaps a half dozen iterations which increases computational complexity.
Ephraim, A Minimum Mean Square Error Approach for Speech Enhancement, Conf.Proc. ICASSP 829 (1990), derived a Wiener filter by first analyzing noisy speech to find linear prediction coefficients (LPC) and then resynthesizing an estimate of the speech to use in the Wiener filter.
In contrast, O'Shaughnessy, Speech Enhancement Using Vector Quantization and a Formant Distance Measure, Conf.Proc. ICASSP 549 (1988), computed noisy speech formants and selected quantized speech codewords to represent the speech based on formant distance; the speech was resynthesized from the codewords. This has problems including degradation for high signal-to-noise signals because of the speech quality limitations of the LPC synthesis.
The Fourier transforms of the windowed sampled speech signals in systems 100 and 150 can be computed in either fixed point or floating point format. Fixed point is cheaper to implement in hardware but has less dynamic range for a comparable number of bits. Automatic gain control limits the dynamic range of the speech samples by adjusting magnitudes according to a moving average of the preceding sample magnitudes, but this also destroys the distinction between loud and quiet speech. Further, the acoustic energy may be concentrated in a narrow frequency band and the Fourier transform will have large dynamic range even for speech samples with relatively constant magnitude. To compensate for such overflow potential in fixed point format, a few bits may he reserved for large Fourier transform dynamic range; but this implies a loss of resolution for small magnitude samples and consequent degradation of quiet speech. This is especially true for systems which follow a Fourier transform with an inverse Fourier transform.