The present invention relates generally to the field of automatic speech recognition and more particularly to a speech signal feature extraction method and apparatus for use therein which is easily tunable and thereby provides improved performance, especially in a variety of adverse (i.e., noisy) environments.
In Automatic Speech Recognition (ASR) systems, certain characteristics or xe2x80x9cfeaturesxe2x80x9d of the input speech are compared to a corresponding set of features which have been stored in xe2x80x9cmodelsxe2x80x9d based on an analysis of previously supplied xe2x80x9ctrainingxe2x80x9d speech. Based on the results of such a comparison, the input speech is identified as being a sequence of a possible set of wordsxe2x80x94namely, the words of training speech from which the most closely matching model was derived. The process known as xe2x80x9cfeature extractionxe2x80x9d is the first crucial step in the ASR process.
Specifically, feature extraction comprises extracting a predefined set of parameter valuesxe2x80x94most typically, cepstral (i.e., frequency-related) coefficientsxe2x80x94from the input speech to be recognized, and then using these parameter values for matching against corresponding sets of parameter values which have been extracted from a variety of training speech utterances and stored in the set of speech models. Based on the results of such a matching process, the input speech can be xe2x80x9crecognizedxe2x80x9dxe2x80x94that is, identified as being the particular utterance from which one of the speech models was derived.
Currently, there are two common approaches to feature extraction which are used in automatic speech recognition systemsxe2x80x94modeling the human voice production mechanism (i.e., the vocal tract) and modeling the human auditory perception system (i.e., the human cochlea and its processing). For the first approach, one of the most commonly employed features comprises a set cepstral coefficients derived from linear predictive coding techniques (LPCC). This approach uses all-pole linear filters which simulate the human vocal tract. A narrow band (e.g., 4 kHz) LPCC feature works fairly well in the recognition of speech produced in a xe2x80x9ccleanxe2x80x9d noise-free environment, but experiments have shown that such an approach results in large distortions in noisy environments, thereby causing a severe degradation of the ASR system performance.
It is generally accepted that improved performance in an ASR system which needs to be robust in noisy environments can be better achieved with use of the second approach, wherein the human auditory perception system is modeled. For this class of techniques, the most common feature comprises the set of cepstral coefficients derived from the outputs of a bank of filters placed in mel frequency scale (MFCC), familiar to those of ordinary skill in the art. The filters are typically in triangular shapes, and are operated in the frequency domain. Note that the mel frequency scale is similar to the frequency response of the human cochlea. Like the LPCC feature, the MFCC feature works very well in xe2x80x9ccleanxe2x80x9d environments, and although its performance in xe2x80x9cadversexe2x80x9d (i.e., noisy) environments may be superior to that of LPCC, ASR systems which have been implemented using the MFCC feature have still not provided adequate performance under many adverse conditions.
Perceptual linear predictive (PLP) analysis is another auditory-based approach to feature extraction. It uses several perceptually motivated transforms including Bark frequency, equal-loudness pre-emphasis, masking curves, etc. In addition, the relative spectra processing technique (RASTA) has been further developed to filter the time trajectory in order to suppress constant factors in the spectral component. It has often been used together with the PLP feature, which is then referred to as the RASTA-PLP feature. Like techniques which use the MFCC feature, the use of these techniques in implemented ASR systems have often provided unsatisfactory results when used in many noisy environments.
Each of the above features is typically based on a Fast Fourier Transform (FFT) to convert speech waveforms from a time domain representation to a frequency domain representation. In particular, however, note that the FFT and other, typical, frequency transforms produce their results on a linear frequency scale. Thus, each of the above perception-based approaches necessarily must perform the filtering process essentially as does the human cochleaxe2x80x94with a complex set of filters differentially spaced in frequency, for example, in accordance with a mel or Bark scale. Moreover, the filters must be individually shaped depending on the particular filter""s location along the scale.
Because of the high degree of complexity in developing filter sets for each of these approaches, it has proven to be very difficult to implement ASR systems which have performed well in various noisy environments. In particular, such ASR systems cannot be easily modified (i.e., xe2x80x9ctunedxe2x80x9d) to optimize its performance in different acoustic environments. As such, it would be advantageous to derive an auditory-based speech feature which included a filter set of reduced overall complexity, thereby allowing for the design and implementation of a relatively easily tunable ASR system whose operation can be optimized in a variety of (e.g., noisy) acoustic environments.
In accordance with the principles of the present invention, an auditory-based speech feature is provided which advantageously includes a filtering scheme which can be easily tuned for use in ASR in a variety of acoustic environments. In particular, the present invention provides a method and apparatus for extracting speech features from a speech signal in which the linear frequency spectrum of the speech signal, as generated, for example, by a conventional frequency transform, is first converted to a logarithmic frequency spectrum having frequency data distributed on a substantially logarithmic (rather than linear) frequency scale. Then, a plurality of filters is applied to the resultant logarithmic frequency spectrum, each of these filters having a substantially similar mathematical shape, but centered at different points on the logarithmic frequency scale. Because each of the filters has a similar shape, an ASR system incorporating the feature extraction approach of the present invention advantageously can be modified or tuned easily, by adjusting each of the filters in a coordinated manner and requiring the adjustment of only a handful of filter parameters.
In accordance with one illustrative embodiment of the present invention, the frequency transform is the FFT, the substantially logarithmic frequency scale is a Bark scale, and the plurality of filters are distributed (i.e., centered) at equal distances along the Bark scale. Also in accordance with this illustrative embodiment of the present invention, an outer and middle ear transfer function is applied to the frequency data prior to the conversion of the frequency spectrum from a linear frequency scale to the substantially logarithmic frequency scale, wherein the outer and middle ear transfer function advantageously approximates the signal processing performed by the combination of the human outer ear and the human inner ear. In addition, and also in accordance with this illustrative embodiment of the present invention, a logarithmic nonlinearity is advantageously applied to the outputs of the filters, and is followed by a discrete cosine transform (DCT) which advantageously produces DCT coefficients for use as speech features in an ASR system.