In recent years, automatic speech recognition (ASR) systems have been employed in a wide variety of areas, such as, for example, telephone dialing, directory assistance, order entry, home banking, database inquiry, and dictation. For example, cellular telephones commonly employ ASR systems to simplify the user interface. Using ASR systems, many cellular telephones recognize and execute commands to initiate an outgoing phone call or answer an incoming phone call. For example, a cellular telephone having an ASR system may recognize a spoken name from a phone book or a contact list and automatically initiate a phone call to the phone number associated with the spoken name.
In an ASR system, a user speaks into a microphone (i.e., inputs a speech signal). The inputted analog signal is digitized and the blocks of digital data are then transformed from the time domain into the frequency domain using a digital signal processing (DSP) chip. Once the ASR system has digitized the signal and calculated certain parameters, the system compares the signal to a library of known phrases and finds the closest match.
To extract the features from the signal for comparison with data in the library, such ASR systems generally use short-term spectral features, such as mel-frequency cepstral (i.e., frequency-related) coefficients (MFCC). MFCCs are based on a Fast Fourier Transform (FFT), which converts the inputted signal from a time domain representation to a frequency domain representation. The MFCC representation is an example of an approach that further analyzes the FFT of the signal. The MFCC representation is generated by using a mathematical transformation called the cepstrum which computes the inverse Fourier transform of the log-spectrum of the speech signal.
These ASR systems uniformly employ short-time spectral analysis, usually over windows of about 10 to 30 milliseconds, as the basis for acoustic representations. It should be noted, however, that the detailed time structure below this timescale is lost and the time structure above this level is weakly represented in the form of deltas. The temporal structure in sub-10 millisecond transient segments contains important cues for both the perception of natural sounds as well as the understanding of stop bursts in speech. The gross temporal distribution of acoustic energy in windows of up to 1 second is a successful domain for the recognition of complete phonemes and the description of their dynamics. Thus, while the spectral structures resulting from the spectral analysis convey important linguistic information, they are only a partial representation of speech signals.
Other feature extraction techniques, such as, for example, dynamic (delta) features and relative spectra processing technique (RASTA), have been adopted as post-processing techniques that operate on sequences of the short-term feature vectors. Such techniques provide a “locally-global” view in which features to be used in classification are based upon a speech segment of about one syllable's length.
Accordingly, it is desirable to provide systems and methods that overcome these and other deficiencies of the prior art.