1. Technical Field
The present invention relates to the field of acoustic signal classification. The present invention economically classifies an unknown acoustic signal into one of a predetermined number of categories.
2. Background Art
A sound recognition system receives analog signals from an acoustic sensor such as a piezoelectric crystal or a microphone.
The analog signal is typically conditioned by three steps. The first step is digitization by a conventional analog-to-digital converter. The signal is next windowed into short-time frames by Hamming windows. Finally Linear Predictive Coding (LPC) is done using the autocorrelation method, to provide a short-time spectral model, which is represented as an LPC vector.
A typical approach to sound recognition after the signal conditioning has been the use of Vector Quantization (VQ) followed by Hidden Markov Models (HMMs). The VQ step provides data compression by converting each natural LPC vector into the closest one of a few previously generated artificial LPC vectors, each of which is given an integer index. The natural LPC vector for each short-time frame is replaced, for that time frame, by the appropriate index. Each sound category of interest is represented by a unique HMM. The sequence of indexes generated by VQ is used as an input to evaluate each HMM. The sound category corresponding to the HMM with the highest score is chosen as the identity of an unknown sound.
A typical sound recognition system receives an analog signal from an acoustic sensor. The signal is conditioned by performing analog-to-digital conversion followed by Hamming windows, typically 6.4 msec for non-speech sounds and 20 msec for speech sounds, and finally by LPC analysis. Any convenient signal conditioner may be used. For non-speech sounds the LPC order, that is, one less than the number of elements in the LPC vector, is typically 4, while for speech sounds the order is typically 10.
The typical sound recognition system operates in two modes: a training mode and a test mode. During training mode the sound recognition system learns characteristics of the sound categories of interest, that is, it makes an indexed list of the best artificial LPC vectors, and a list of the HMMs which best correlate these artificial vectors with the known categories of sound. Once trained, the system is ready to classify unknown sounds in the test mode, that is, to convert the sequence of LPCs under test to a sequence of artificial vector indexes, and to then select the HMM which best fits this sequence.
During the training mode, a data base called the training data base is used. The training data base is partitioned as follows. An occurrence of a sound is called a token and is typically a few seconds long. Each sound category is represented by typically a hundred tokens. For example, suppose that three categories of sound are to be distinguished: a drawer opening, a glass being filled with water, and a tool being dropped. We begin by opening a drawer a hundred times and recording each opening; each recording is a token. Likewise, a glass is filled (and emptied, between recordings) a hundred time, each recording being a token; and a tool is dropped a hundred times, each recording being a token. In each category, tokens for that category are called the training set for that category: thus there are as many training sets as categories. There are three training sets, one for each sound, each training set having a hundred tokens. In the above example, the training sets form a training data base of three hundred tokens.
In VQ, a prespecified, relatively few number of artificially generated LPC vectors are created to represent all of the LPC vectors (one for each short-time window of noise in each token) in all of the tokens in the training data base, regardless of sound category. That is, a window of noise may appear in more than one category of sound. The 20 milliseconds of noise of water first contacting a glass may be functionally identical to the 20 milliseconds of noise of a tool first contacting the floor. A typical LPC vector representing this common noise will be artificially generated if it (or the LPC vectors close enough to it) appears frequently enough in the various training sets. These few artificially generated LPC vectors can be considered averages of all the LPC vectors in the training data base. Each of these few LPC vectors is called a codeword; a collection of codewords is called a codebook. During VQ training, the codewords are selected by using the entire training data base as input to the signal conditioner, then processing the output of the signal conditioner, which is a sequence of LPC vectors, according to the well known Linde, Buzo, and Gray (LBG) clustering algorithm. The final result is a codebook of codewords, each codeword being an LPC vector. Each codeword is assigned an arbitrary index, for example 0 to 127 when the codebook has 128 codewords, as is typical. Each LPC vector has a number of dimensions equal to the LPC order plus 1. The large number of naturally occurring LPC vectors are thus quantified into a small number of codewords, and vector quantization training is complete.
Once the VQ training is done, the HMM training begins. HMM training differs from VQ training in that only one training set at a time is used, not all training sets. That is, the HMM for each category of sound is trained independently of the HMM for each other category of sound. An HMM for a given sound category is trained as follows. The training set corresponding to that sound category is selected. Each token from the set is used as an input to the signal conditioner, which generates a sequence of LPC vectors, to be used as training vectors. This sequence of LPC training vectors is then compared, one vector at a time, with the codewords in the VQ codebook. The closest matching codeword to each LPC training vector is found and the index of the codeword is the output of the VQ process. Each token is thus reduced to a sequence of indexes. The sequence of the indexes generated from the entire training set of tokens is used as the input to the HMM training algorithm, according to the well known Baum-Welch re-estimation algorithm. HMMs for each sound category are trained in turn.
Once HMM training has also been completed, the sound recognition system is ready to accept an unknown signal as an input. An unknown signal--a token--is input to the signal conditioner and the output is a sequence of LPC vectors. The LPC vectors are used as the input to the VQ. The output of the VQ is a sequence of indexes, that is, the indexes of the previously generated artificial LPCs. While the unknown signal could be used to modify the artificially generated LPC, it is preferred not to do so, since the unknown signal generally has considerable background noise, which is absent from the training tokens. The indexes are then input to the HMM comparison function. One HMM at a time is selected to be evaluated using the sequence of VQ indexes and the well known Viterbi algorithm. The output is a score, which is actually the probability, or, if desired, the log of the probability, that the HMM is the correct one. Each HMM is evaluated in turn, and the HMM with the highest score is selected. The sound category corresponding to that selected HMM is the classification of the unknown signal.