(1) Field of the Invention
This invention generally relates to a method and system for identifying data segments within a signal by using naturally occurring boundaries in the signal and updating sample-by-sample.
More particularly, the invention is directed to solving the problem of dividing an input signal, such as an acoustic data signal or a speech signal, consisting of multiple “events” into frames where the signal within each frame is statistically “consistent”. Once the data has been segmented, detection and classification of events is greatly facilitated. In speech signals, for example, the data becomes segmented into phonetically constant frames or frames in which there are an integer number of pitch periods. This makes determination of pitch more accurate and reliable.
(2) Description of the Prior Art
Prior to this invention, it has not been known how to divide a time-series (signal) into segments with a fine enough resolution corresponding to individual pitch interval boundaries. The current art for optimally segmenting a time-series consists of first segmenting the data into fixed-size segments, then performing a second stage of segmentation to group together numbers of the fixed-size segments into larger blocks. This approach has a resolution no finer than the size of the fixed-size segments.
Because speech signals contain features that are very short in duration, it would be preferable to segment the data to a finer resolution, such as to a resolution of one sample. The current art cannot be used to segment the data to a resolution of one sample because it requires first segmenting to fixed-size segments large enough to extract meaningful features. Furthermore, the existing dynamic-programming solution is computationally impractical because the data has to be processed at each delay and at each segment length.
Thus, a problem exists in the art whereby it is necessary to develop a computationally efficient and practical method of segmenting multiple events into frames to a resolution of one sample necessary to identify individual pitch intervals.
By way of example of the state of the art, reference is made to the following papers, which are incorporated herein by reference. References pertaining to the prior art are contained in the following references:    [1] Euler, S. A.; Juang, B. H.; Lee, C. H.; Soong, F. K., Statistical Segmentation and Word Modeling Techniques in Isolated Word Recognition, 1990 International Conference on Acoustics, Speech, and Signal Processing, vol.2, pp. 745-748.    [2] Svendsen, F. Soong, On the Automatic Segmentation of Speech Signals, 1987 International Conference on Acoustics, Speech, and Signal Processing, pp. 77-80, Dallas, 1986.    [4] R. Kenefic, An Algorithm to Partition DFT Data into Sections of Constant Variance, IEEE Trans AES, July 1998
Referring further to the current state of the art as developed in the field to date, it should be understood that detection and classification of short signals is a high priority for the Navy. Segmentation of a time series is a method that facilitates detection and classification.
In segmentation of short signals, the following is an illustration of the current state of the art. Let there be N samples x=[x1 . . . xN]. One would like to divide these samples into a number of segments, for example:x=[x1 . . . xa][xa+1 . . . xb][xb+1 . . . xc][xc+1 . . . x N],such that the total score, Q, where:Q=Q(x1 . . . xa)+Q(xa+1 . . . xb)+Q(xb+1 . . . xc)+Q(xc+1 . . . xN)is as high as possible.
To do this, the score function, Q(n,t), must be known for a segment of length n ending at time t. Assuming it is known, the problem is to find the best number of segments and their start times {a, b, c, d . . . }. The standard dynamic-programming approach disclosed in Bellman and also Soong, above, is to first compute the score for all possible segment lengths at all possible end-times. In other words, compute Q(n,t) for t=nmin . . . N and n=nmin to nmax where nmin and nmax are the range of allowed segment lengths. The problem is solved by starting at sample nmin because the best solution for segmenting the data up to sample nmin is immediately known, it is just the value of the score function Q when n=nmin and t=nmin, Q(nmin, nmin). Let this be called Qb(nm). The best solutions for later samples are then easily found as follows:
Qb(t)=Q(n,t)+Qb(t−n) maximized over n.
Since Qb(t−n) was already computed, all of the necessary information is available. The value of n for this solution is also saved and is called nm(t). This process proceeds until Qb(N), nb(N). The problem is then solved. The maximum total score is Qb(N) and the length of the last segment is nb(N). The other segment lengths are found by working backwards. For example, the length of the next-to-last segment is nb(N−nb(n)), which was previously stored. This is the standard approach taught in the prior art.
In many problems, it is needed to have the best segments and also to pick the best models for each segment. In speech, for example, it may be necessary to know if a segment is voiced or unvoiced speech or it might be necessary to choose the best model order. Let p be an index that ranges over all possible models. To find the best combination of segment lengths and model indexes, first the score function Q(p,n,t) must be known. A slight modification is then made to the above procedure by carrying out the maximizations at each time over both n and p jointly.
What has been described so far is the standard approach taught by the Bellman and Soong references. The problem with applying the method to speech processing and other fields is that computing the score function is time-consuming and the method is not practical to apply sample-by-sample as data is acquired. Instead, it is necessary to apply the method to a coarse resolution defined by the frame-processing interval taught by Soong. Features of the data finer than the frame processing interval are filtered out of the data.
As mentioned, sample-by-sample processing is normally impractical. If the score function is computed on samples [xt−n+1 . . . xt], and it is desired to move over one sample to [xt−n+2 . . . xt+1], it is necessary to re-compute the entire score function. This is because the state of the art in signal processing in speech and other fields uses the Fast Fourier Transform (FFT) and a “window” function such as a Hanning window. Window functions are necessary to smooth transitions in the data and eliminate edge effects. This is because the data is processed in “chunks” which are not always aligned with the naturally occurring event boundaries.
It should be understood that the present invention would in fact enhance the functionality of the above cited art by the combined effect of eliminating the window function previously used, and providing sample-by-sample updates.