The present invention generally relates to an apparatus for extracting features from a speech signal and, in particular, relates to one such apparatus that employs a polyphase digital filterbank for extracting a spectral envelope from a speech signal.
In the field of speech recognition and/or speaker verification as opposed to, for example, any revocalization of a spoken word, a relatively small number of features are required for the desired identification. However, in order to provide a reliable system, the extraction of those features must be accomplished accurately and consistently.
The accurate and consistent extraction of spectral features is, to a very large degree, dependent on a filterbank. That is, an analog speech signal representing a spoken word has an amplitude that changes with both frequency and time. Such a signal is sampled in both the time and frequency domains. The frequency domain samples, at each sampling time, contain the primary spectral features of interest. Thus, in order to extract such features, for each time sampled signal, the frequency domain signal is formed by filtering.
Until recently, filterbanks for speech recognition systems have been implemented using analog filter theory and technology. Analog filterbanks usually perform somewhat poorly. This poor performance is primarily due to the inherent limitations of analog components, i.e., analog components are inherently very difficult to reproduce with the accuracy necessary for speech recognition applications. In addition, the values of analog components inherently vary over time and are susceptible to such factors as temperature changes, surrounding radiation and the like. Thus, to provide an analog filterbank of acceptable quality, very precise, and correspondingly expensive, components must be used.
The relatively recent development of high speed digital signal processors has allowed the design and implementation of filterbanks based on digital filter theory and technology. The very nature of digital technology results in high performance digital filterbanks having exact response predictability. The performance of such digital filterbanks directly depends on the binary word length of the digital signal processor hardware used in the implementation thereof.
Nevertheless, it is not a straight forward task to design a high peformance digital filterbank. For example, using a conventionally designed digital filter, a modern digital signal processor operating at full capacity and conventional techniques provides a filterbank having a dynamic range of about 45 dB and a 14 band spectral envelope. Since the human voice has a dynamic range about 45 dB, such performance characteristics are barely adequate for a reasonably accurate speech recognition/speaker verification system. That is, the above performance characteristics would require a user to speak in a monotone to avoid loss of information. The number of bands extracted is directly related to the resolution of the filterbank. Thus, the more bands the greater the accuracy and consistency of the features extracted.
In addition to the general filterbank design difficulties, conventional speech recognition/speaker verification systems usually exhibit poor performance due to other difficulties. One difficulty results from the fact that filterbanks are composed of a set of nonoverlapping band pass filters, each having a finite transition band. Due to the somewhat periodic nature of a speech signal, the speech spectrum manifests a relatively strong fundamental pitch frequency. When this fundamental pitch frequency occurs between adjacent bands important spectral information is lost and the results become less accurate.