As is known, the first step for automatic speech recognition (ASR) is front-end processing, during which a set of parameters characterizing a speech segment is determined. Generally, the set of parameters should be discriminative, speaker-independent and environment-independent.
For the set to be discriminative, it should be sufficiently different for speech segments carrying different linguistic messages. A speaker-independent set should be similar for speech segments carrying the same linguistic message but spoken or uttered by different speakers, while an environment-independent set should be similar for the speech segments which carry the same linguistic message, produced in different environments, soft or loud, fast or slow, with or without emotions and processed by different communication channels.
U.S. Pat. No. 4,433,210, Ostrowski et al., discloses an integrated circuit phoneme-based speech synthesizer. A vocal tract comprised of a fixed resonant filter and a plurality of tunable resonant filters is implemented utilizing a capacitive switching technique to achieve relatively low frequencies of speech without large valued componentry. The synthesizer also utilizes a digital transition circuit for transitioning values of the vocal tract from phoneme to phoneme. A glottal source circuit generates a glottal pulse signal capable of being spectrally shaped in any manner desired.
U.S. Pat. No. 4,542,524 Laine, discloses a model and filter circuit for modeling an acoustic sound channel, uses of the model and a speech synthesizer for applying the model. An electrical filter system is employed having a transfer function substantially consistent with an acoustic transfer function modelling the sound channel. The sound channel transfer function is approximated by mathematical decomposition into partial transfer functions, each having a simpler spectral structure and approximated by a realizable rational transfer function. Each rational transfer functions has a corresponding electronic filter, the filters being cascaded.
U.S. Pat. No. 4,709,390, Atal et al., discloses a speech coder for linear predictive coding (LPC). A speech pattern is divided in successive time frames. Spectral parameter and multipulse excitation signals are generated for each frame and voiced excitation signal intervals of the speech pattern are identified, one of which is selected. The excitation and spectral parameter signals for the remaining voiced intervals are replaced by the multipulse excitation signal and the spectral parameter signals of the selected interval, thereby substantially reducing the number of bits corresponding to the succession of voiced intervals.
U.S. Pat. No. 4,797,926, Bronson et al., discloses a speech analyzer and synthesizer system. The analyzer is utilized for encoding and transmitting, for each speech frame, the frame energy, speech parameters defining the vocal tract (LPC coefficients), a fundamental frequency and offsets representing the difference between individual harmonic frequencies and integer multiples of the fundamental frequency for subsequent speech synthesis. The synthesizer, responsive to the transmitted information, calculates the phases and amplitudes of the fundamental frequency and the harmonics and uses the calculated information to generate replicated speech. The invention further utilizes either multipulse or noise excitation modeling for the unvoiced portion of the speech.
U.S. Pat. No. 4,805,218, Bamberg et al., discloses a method for speech analysis and speech recognition which calculates one or more difference parameters for each of a sequence of acoustic frames. The difference parameters can be slope parameters, which are derived by finding the difference between the energy of a given spectral parameter of a given frame and the energy, in a nearby frame, of a spectral parameter associated with a different frequency band, or energy difference parameters, which are calculated as a function of the difference between a given spectral parameter in one frame and spectral parameter in a nearby frame representing the same frequency band.
U.S. Pat. No. 4,885,790, McAulay et al., discloses a speech analysis/synthesis technique wherein a speech waveform is characterized by the amplitudes, frequencies and phases of component sine waves. Selected frames of samples from the waveform are analyzed to extract a set of frequency components, which are tracked from one frame to the next. Values of the components from one frame to the next are interpolated to obtain a parametric representation of the waveform, allowing a synthetic waveform to be constructed by generating a series of sine waves corresponding to the parametric representation.
U.S. Pat. No. 4,897,878, Boll et al., discloses a method and apparatus for noise suppression for speech recognition systems employing the principle of a least means square estimation implemented with conditional expected values. A series of optimal estimators are computed and employed, with their variances, to implement a noise immune metric, which enables the system to substitute a noisy distance with an expected value. The expected value is calculated according to combined speech and noise data which occurs in the bandpass filter domain.
U.S. Pat. No. 4,908,865, Doddington et al., discloses a speaker-independent speech recognition method and system. A plurality of reference frames of reference feature vectors representing reference words are stored. Spectral feature vectors are generated by a linear predictive coder for each frame of the input speech signals, the vectors then being transformed to a plurality of filter bank representations. The representations are then transformed to an identity matrix of transformed input feature vectors and feature vectors of adjacent frames are concatenated to form the feature vector of a frame-pair. For each reference frame pair, a transformer and a comparator compute the likelihood that each input feature vector for a frame-pair was produced by each reference frame.
U.S. Pat. No. 4,932,061, Kroon et al., discloses a multi-pulse excitation linear predictive speech coder comprising an LPC analyzer, a multi-phase excitation generator, means for forming an error signal representative of difference between an original speech signal and a synthetic speech signal, a filter for weighting the error signal and means responsive thereto for generating pulse parameters controlling the excitation generator, thereby minimizing a predetermined measure of the weighted error signal.
U.S. Pat. No. 4,975,955, Taguchi, discloses a speech signal coding and/or decoding system comprising an LPC analyzer for deriving input speech parameters which are then attenuated and fed to an LSP analyzer for deriving LSP parameters. The LSP parameters are then supplied to a pattern matching device which selects from a reference pattern memory the reference pattern which most closely resembles the input pattern from the LSP analyzer.
U.S. Pat. No. 4,975,956, Liu et al., discloses a low-bit-rate speech coder using LPC data reduction processing. The coder employs vector quantization of LPC parameters, interpolation and trellis coding for improved speech coding at low bit rates utilizing an LPC analysis module, an LSP conversion module and a vector quantization and interpolation module. The coder automatically identifies a speaker's accent and selects the corresponding vocabulary of codewords in order to more intelligibly encode and decode the speaker's speech.
Additionally, a new front-end processing technique for speech analysis, was discussed in Dr. Hynek Hermansky's article titled "Perceptual Linear Predictive (PLP) Analysis of Speech," J. Acoust. Soc. Am. 87(4), April, 1990, which is hereby expressly incorporated by reference in its entirety. In the PLP technique, an estimation of the auditory spectrum is derived utilizing three well-known concepts from the psychophysics of hearing: the critical-band spectral resolution, the equal-loudness curve and the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model, resulting in a computationally efficient analysis that yields a low-dimensional representation of speech, properties useful in speaker-independent automatic speech recognition. A flow chart detailing the PLP technique is shown in FIG. 1.
Most current ASR front-ends are based on robust and reliable estimation of instantaneous speech parameters. Typically, the front-ends are discriminative, but are not speaker- or environment-independent. While training of the ASR system (i.e. exposure to a large number of speakers and environmental conditions) can compensate for the failure, such training is expensive and seldom exhaustive. The PLP front-end is relatively speaker independent, as it allows for the effective suppression of the speaker-dependent information through the selection of the particular model order.
Most speech parameter estimation techniques, including the PLP technique, however, are sensitive to environmental conditions since they utilize absolute spectral values that are vulnerable to deformation by steady-state non-speech factors, such as channel conditions and the like.
Non-linguistic factors, such as environmental noise and linear spectral modification, can wreak havoc with speech processing systems, and in particular, can greatly increase the errors in a speech recognition system. The application of a linear time-invariant filtering operation to a speech signal during recognizer testing can significantly impact performance, as can the addition of noise. While real-life conditions include many other effects that are difficult to control (such as non-linear and/or phoneme-specific distortions), the simple linear operations described above are sufficient to seriously impact performance. It has been noted that a simple change of microphones between training and testing sessions can increase errors by a large factor (e.g. from two to ten).
It is desirable to provide some robustness against errors caused by convolutional effects and additive noise since, in the general case, noise is both additive and convolutional; in particular, any real speech input includes both the effects of environmental echo response and microphone impulse response, as well as additive noise.