1. Field of the Invention
This invention relates generally to electronic voice processing systems, and relates more particularly to a system and method for voice signal classification based on statistical regularities in voice signals.
2. Description of the Background Art
Speech recognition systems may be used for interaction with a computer or other device. Speech recognition systems usually translate a voice signal into a text string that corresponds to instructions for the device. FIG. 1 is a block diagram of a speech recognition system of the prior art. The speech recognition system includes a microphone 110, an analog-to-digital (A/D) converter 115, a feature extractor 120, a speech recognizer 125, and a text string 130. Microphone 110 receives sound energy via pressure waves (not shown). Microphone 110 converts the sound energy to an electronic analog voice signal and sends the analog voice signal to A/D converter 115. A/D converter 115 samples and quantizes the analog signal, converting the analog voice signal to a digital voice signal. Typical sampling frequencies are 8 KHz and 16 KHz. A/D converter 115 then sends the digital voice signal to feature extractor 120. Typically, feature extractor 120 segments the digital voice signal into consecutive data units called frames, and then extracts features that are characteristic to the voice signal of each frame. Typical frame lengths are ten, fifteen, or twenty milliseconds. Feature extractor 120 performs various operations on the voice signal of each frame. Operations may include transformation into a spectral representation by mapping the voice signal from time to frequency domain via a Fourier transform, suppressing noise in the spectral representation, converting the spectral representation to a spectral energy or power signal, and performing a second Fourier transform on the spectral energy or power signal to obtain cepstral coefficients. The cepstral coefficients represent characteristic spectral features of the voice signal. Typically, feature extractor 120 generates a set of feature vectors whose components are the cepstral coefficients. Feature extractor 120 sends the feature vectors to speech recognizer 125. Speech recognizer 125 includes speech models and performs a speech recognition procedure on the received feature vectors to generate the text string 130. For example, speech recognizer 125 may be implemented as a Hidden Markov Model (HMM) recognizer.
Speech recognition systems translate voice signals into text: however, speaker-independent speech recognition systems are generally rigid, inaccurate, computationally-intensive, and are not able to recognize true natural language. For example, typical speech recognition systems have a voice-to-text translation accuracy rate of 40%-50% when processing true natural language voice signals. It is difficult to design a highly accurate natural language speech recognition system that generates unconstrained voice-to-text translation in real-time, due to the complexity of natural language, the complexity of the language models used in speech recognition, and the limits on computational power.
In many applications, the exact text of a speech message is unimportant, and only the topic of the speech message needs to be recognized. It would be desirable to have a flexible, efficient, and accurate speech classification system that categorizes natural language speech based upon the topics comprising a speech message. In other words, it would be advantageous to implement a speech classification system that categorizes speech based upon what is talked about, without generating an exact transcript of what is said.