Almost all current technology for signal processing for applications in the areas of speech recognition and speaker verification or identification is based on a variant of the frequency spectrogram, which is a representation of the energy in a signal as a function of frequency and time. While spectrographic processing was originally implemented by means of analog electronic hardware, including analog filter banks (e.g. the voice print), currently, spectral analysis is primarily implemented using the techniques of digital signal processing. The methods of spectral analysis include fast Fourier transformation (FFT), power spectral density (PSD) analysis, extraction of linear predictive coding (LPC) and cepstral coefficients. Other methods include processing by digital filter banks comprising filters designed by standard methods and filters whose design is purportedly based on some feature of the response of the auditory system. Spectrographic processing is usually applied with the aim of extracting important linguistic features from the speech signal such as the frequencies and times of occurrence of the formants. These speech features are often obtained by comparing the spectrographic patterns to templates or rules. Other conventional signal processing techniques are used to detect speech features. For example, autocorrelation functions are used to extract the pitch of a voiced utterance; zero-crossing profiles are used to discriminate between voiced and unvoiced segments of speech (Schafer, R. W. and Rabiner, L. R. (1978): Digital Processing of Speech Signals. Englewood Cliffs (N.J.): Prentice-Hall).
In general, conventional methods of speech processing suffer from several well-known problems:
Susceptibility to noise. Because the profile of spectral energy that constitutes the spectrogram is sensitive to anything that changes the relative magnitude of in-band energies, spectrographic representations can be severely degraded in situations of practical interest, such as the presence of high background or line noise; PA1 Sensitivity to spectral shaping or bandwidth reduction. The characteristics of the communication channel can affect the spectrum of the input signal, thereby altering the profile of spectral energy, and distorting the spectrogram; PA1 Non-selectivity for speech. Spectrographic techniques measure the frequency profile of signal energy irrespective of the source of that energy. They are not inherently selective for speech signals. Sources of signal energy such as line and environmental noise or non-speech signals such as music or tones create spectrographic patterns that can result in the mis-identification of relevant speech parameters; PA1 Difficulty in estimating formant information. Conventional speech processing methods often have difficulty in estimating the pitch and formant frequencies of a voiced utterance. Speech is a temporally and spectrally complex waveform. Voiced portions of speech comprise epochs of wide spectral bandwidth (corresponding to the glottal, or pitch, pulses) alternating with epochs characterized by a more discrete frequency spectrum (corresponding to the formant frequencies). For spectrographic schemes aimed at the extraction of formant frequencies, the energy in the glottal pulse represents a confounding element. Techniques well known to the art, such as cepstral analysis and pitch-synchronous spectral extraction, have been employed in an attempt to separate the pitch from formant information; PA1 Difficulty in estimating pitch. Speech is non-stationary and non-periodic. In voiced segments of speech, pitch is rarely constant, and autocorrelation techniques for the extraction of pitch, which essentially measure periodicity, can be inaccurate; PA1 Sensitivity to segmentation of input data. In spectrographic sound analysis methods, sound data are usually segmented or windowed into frames (generally 10 to 20 milliseconds long) for analysis. The onset and duration of the frame can affect the accurate localization of spectrographic features in the time and frequency domains. For small frame sizes, spectrographic methods can follow the dynamic character of the speech, but with reduced frequency resolution, whereas for larger frame sizes, the frequency resolution improves at the expense of the resolution of the dynamic time-domain characteristics. Accurate time and frequency localization of formants is difficult because the formant frequencies can vary between adjacent glottal pulses occurring less than 5 milliseconds apart. PA1 Sound is analyzed using a model of the human cochlea which simulates the waveform propagation characteristics of the basilar membrane. Our preferred model is implemented as an array of filters, the frequency and phase response of each of these filters being chosen to substantially match waveform propagation characteristics at equally spaced haircell locations along the length of the basilar membrane of the cochlea. PA1 The response sequences computed from the array of filters are then processed by an array of primary feature detectors which are designed to emulate the signal processing characteristics of cells in the brainstem and auditory cortex. The essential attribute of these detectors is that they detect local spatial and temporal patterns of the response of the filters. For example, in the neural-correlation implementation, the primary feature detectors detect patterns in the response of an array of filters that correspond to patterns of the discharge of groups of auditory-nerve fibers. In the phase-coherence implementation, the primary feature detectors detect patterns in spatial and temporal derivatives of the instantaneous phase. In the instantaneous-frequency implementation, the primary feature detectors detect patterns of the instantaneous frequency from a group of channels. The primary feature detectors include local impulse detectors, which detect impulsive features in the stimulus, and local synchrony detectors, which detect synchronous regions of the response. These detectors respond to local spatial and temporal patterns of neural firing or spatio-temporal derivatives of basilar-membrane motion. In the context of these detectors, the term "local" means that each primary feature detector in the array detects only patterns of the response of filter channels over a restricted range of channels and over a restricted interval of time. PA1 The outputs of the primary feature detectors are processed by an array of secondary feature detectors which detect patterns in the response of the array of primary feature detectors. Secondary feature detectors include the local formant detector, which detects the times of occurrence and the frequencies of the formants in speech, and the global pulse detector, which detects the times of occurrence of the glottal pulses. Whereas the term "local" indicates that the response of each channel of the secondary detector depends only upon a restricted range of channels of the primary feature detector stage, the term "global" means that the response of each channel of the secondary detector depends upon a large number of channels the primary feature detector stage. PA1 They are insensitive to additive noise; PA1 They are insensitive to spectral shaping or bandwidth reduction of the input; PA1 They are selective for the detection of speech features such as pitch and formant information; PA1 They do not require critical data segmentation; PA1 They simultaneously show high temporal and frequency resolution; PA1 Results can be obtained in a computationally efficient manner.
Several schemes have been disclosed in the prior art to process speech using methods specifically derived from an analysis of signal processing in the human auditory system. U.S. Pat. No. 4,536,844 issued to Richard F. Lyon on Aug. 20, 1985 discloses a method and apparatus for simulating auditory response information where the input signals are analyzed by a filter-bank comprising an array of high-order filters, each created from a cascade of linear, time-invariant, second-order digital filter sections followed by a stage of rectification and nonlinear dynamic range compression (automatic gain control). While this process purports to produce a representation similar to the human neural response, the resulting response does not, in fact, correspond to the measured experimental data from auditory-nerve fibers (Pickles, J. O. (1988): Introduction to the Physiology of Hearing. 2nd edition. London: Academic Press). Similar processing schemes are also described in the literature (Seneff, S. (1985): A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics 16, 55-76; Kates, J. M. (1991): A time-domain digital cochlear model. IEEE Transactions on Signal Processing 39, 2573-2592.) All of these approaches generate an essentially spectral representation of speech.
U.S. Pat. No. 4,905,285, issued to Jont B. Allen et al. on Feb. 27, 1990 also discloses a method based on a model that purports to represent the frequency distribution of human neural response. In this method, the speech signal is analyzed by a bank of filters whose frequency response is derived from a mathematical model of the motion of the basilar membrane. The time waveform which constitutes the output of each spectral band is passed through a series of threshold detectors. The times between successive threshold crossings of detectors are measured and accumulated into an interval histogram. Interval histograms for a plurality of spectral bands are then combined to produce an ensemble histogram. From this histogram, a profile of the dominant average frequency components of an input signal is generated by means of conventional signal processing techniques (inverse Fourier transformation and autocorrelation). U.S. Pat. No. 4,075,423 issued to M. J. Martin et al. on Feb. 21, 1978 discloses a similar scheme based on accumulating a histogram of frequency patterns of detected waveform peaks. Spectrographic processing schemes based on threshold crossings of detected waveform peaks are also well documented in the literature (Niederjohn, R. J. (1985): A zero-crossing consistency method for formant tracking of voiced speech in high noise levels. IEEE Transactions of Acoustics, Speech and Signal Processing, vol ASSP-33, 2; Ghitza, O. (1985): A measure of in-synchrony regions in the auditory nerve firing patterns as a basis of speech vocoding. Proceedings International Conference, Acoustics Speech and Signal Processing.)
There are several significant disadvantages and problems with the neural threshold-crossing methods of the prior art that limit their applicability. Chief among these are the issues of temporal granularity and computational intractability. Threshold-crossings of the neural response model occur only at discrete intervals, which can be spaced milliseconds apart in model fibers with low center frequencies; hence, spectral estimates obtained from the histograms of threshold crossings will be temporally coarse or granular. Computing the complete response of the neural model fibers requires the solution of the nonlinear cochlear model equations for a plurality of parallel channels. The computational load of performing these calculations in real-time or near real-time can be prohibitive. Finally, the neural threshold-crossing methods are not speech specific and thus do not result in the identification of unique speech features.
The present invention provides a novel signal processing system based on signal processing by the auditory system that overcomes these and other problems of the prior art.