Accurate identification of statistically stationary units in a continuous signal can lead to a substantial reduction in computational costs while processing the signal. Statistically stationary units are discrete portions of the continuous signal that have characteristics which can statistically be described in a similar manner.
The identification of the stationary units requires the location of segment boundaries. If the segment boundaries are correctly hypothesized, then the effort required to correlate information related to the units is greatly reduced. Segmentation is particularly difficult where there is little prior knowledge about the underlying content of the signal.
For example, in a speech recognition system, a continuous signal is processed to determine what has been spoken. Segmentation of the signal into statistically stationary units is an important sub-process in a segment-based speech processing system. Segmentation identifies possible boundaries of portions of the signal which are likely to correspond to linguistic elements.
Accurate identification of statistically stationary units can lead to a substantial reduction in computational costs. If the segment boundaries are correctly hypothesized, then the time to search a database for corresponding linguistic elements is greatly reduced. Segmentation is particularly difficult where there is little prior knowledge about the content of the signal.
Most signal processing systems receive the signal in a continuous analog form. The analog signal is typically sampled at a fixed rate to produce a sequence of digital samples which can be processed by a computer system.
One prior art segmentation technique, as described by R. Andre-Obrecht in Automatic Segmentation of Continuous Speech Signals, Proceedings of IEEE-IECEF-ASJ International Conference on Acoustic, Speech Signal Processing, Vol. 3, pp. 2275-2278, April 1986, uses a statistical approach to detect spectral changes in the continuous signal. The technique processes the signal sample-by-sample using three fixed windows.
A first window is a growing window which starts at the first sample after the time of the last detected change and ends at the current measurement Thus, the first window includes all of the measurements after the last detected change. A second window starts at the first sample after the time of the last detected change, and ends a fixed L samples before the current measurement. Thus, the second window overlaps the first window for all of the samples except the last L samples. A third window starts after the second window, and ends with the current measurement. Thus, the second window combined with the third window includes all of the measurements included in the first window without any overlapping.
The technique uses these three windows to compute a sequential likelihood ratio test on the samples within the windows. The likelihood that all of the measurements since the last detected change belong to one statistical unit is computed using the first window. The likelihood is compared with the likelihood that the measurements belong to two statistical units with the change occurring L samples in the past from the current measurement. In the likelihood ratio test, the first window encodes the null hypothesis of no change in the samples, while the second and third window encode the change hypothesis.
The samples are sequentially processed in the temporal order of the signal by advancing each of the three windows. In a variant, the samples of the signal are processed both forward and backward in time, and the resulting segment boundaries are combined to form one segmentation.
In another variant, a plurality of windows can be used for the change hypothesis. In this case, each window corresponds to a plurality of lengths L. All variants of this technique tend to be computationally intensive since they work directly on the individual samples. Moreover, since the samples are processed in a temporal order, once samples have been identified with a particular segment, the samples are not re-examined. This sequential processing may generate erroneous boundaries.
In another segmentation approach, the samples of the signal are first grouped into a sequence of fixed-length overlapping frames. These frames are then converted to derived observation vectors by applying a windowing vector, typically a Hamming window, to each frame resulting in a sample vector. A fast Fourier transform is then applied to each sample vector to produce the final derived observation vectors. The overlapping of the frames results in substantial smoothing of spectral changes in the signal with time. This smoothing makes it more difficult to detect the changes. Furthermore, application of the windowing vector also results in a smoothing of the spectrum in the frequency domain. This also decreases the size of spectral changes.
The parameters of the observation vectors can be Mel-frequency power spectral coefficients (MFSC), or Mel-frequency cepstral coefficients (MFCC) as described by P. Mennelstein and S. Davies in Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE Trans ASSP, Vol. 23, No. 1, pages 67-72, February 1975.
The observation vectors can be combined using a hierarchical clustering technique, see for example, J. R. Glass, Finding Acoustic Regularities in Speech, Applications to Phonetic Recognition. Ph.D. Thesis. Department of Electrical Engineering and Computer Science, MIT. May 1988. In this technique, successive adjacent vectors are merged using some similarity metric. For example, the techniques can determine the "difference"or distance between adjacent vectors. If the distance between any pair of adjacent vectors is less than some predetermined threshold, the vectors are merged to form a cluster. This process is repeated on the thus merged clusters until the distance between any two adjacent clusters is greater than the threshold. At this point the clusters can be identified with linguistic elements.
For observation vectors expressed with MFCCs, the measure of difference can be a normalized distance. For example, the normalized distance between two measurement vectors x and y is: ##EQU1##
Slightly better results can be obtained if a weighted Euclidean distance is measured between the logarithms of the MFSCs. The problem with this type of clustering is that some of the information present in the raw digital samples is lost in the derived observation vectors, leading to less than optimal segmentation results.
It is desired to directly segment a continuous signal without initially reducing the signal to a sequence of derived observation vectors using overlapping frames. Furthermore, it is desired to segment a signal without having prior knowledge about the content of the signal. In addition, it is desired to segment the signal such that transcription error rates are reduced.