The invention generally relates to the field of speech signal processing, particularly speech recognition and voice output.
In speech output, individual short speech segments are generated which time-sequentially yield a speech signal closely resembling an expression uttered in natural speech. For generating the individual speech segments, it is desirable to use a minimal number of parameters which nevertheless model a speech segment as accurately as possible. These parameters are based on the natural speech track which has different resonance frequencies with generally different bandwidths for generating different sounds. The resonance frequencies in the speech signal are called formant frequencies, and their indication and their bandwidth is then sufficient for generating different sounds. These parameters may advantageously be gained from a natural speech signal.
However, gaining these parameters from a natural speech signal may also be used for speech recognition. In this case, a speech signal is divided into short periods, and characteristic values are derived from each period and compared with reference values which correspond to given sounds. By further processing the results of the comparison, it can be determined which expression was most probably uttered. The characteristic values may be, for example, the energies in successive frequency segments. However, good results can also be achieved when the formant frequencies are used as characteristic values. With these frequencies, many deviations of really uttered expressions from the reference values used for the recognition can be better taken into account.
It is an object of the invention to provide a method with which the formant frequencies or the characteristic values indicating these formant frequencies can be determined from a speech signal in a reliable manner and with a relatively small number of computations so that, essentially, real-time processing is possible.
According to the invention, this object is solved in that initially the power density spectrum is formed via discrete frequencies of consecutive periods of the speech signal. For a predetermined, first number of consecutive segments of the power density spectrum, the first three autocorrelation coefficients are formed in each of these periods. For this purpose, the boundary frequencies of the segments must be determined, which are optimal for an approximation by a model function with a number of formant frequencies corresponding to the number of segments. For this determination of the boundary frequencies, an error value is formed from the autocorrelation coefficients for each segment, and the error values of all segments are summed, and the formation of the autocorrelation coefficients and the error values is repeated for different boundary frequencies between the segments until the minimum of the sum of the error values and the associated optimum boundary frequencies have been determined. Finally, at least one characteristic value is derived from the autocorrelation coefficients of the segments with the optimum boundary frequencies for each segment. These values may be prediction coefficients which can be directly determined from the autocorrelation coefficients, or the resonance frequencies and possibly the bandwidth which unambiguously results again from the prediction coefficients.
The direct formation of the autocorrelation coefficients for given frequency segments of the power density spectrum requires some computation. In accordance with an embodiment of the invention, a simpler mode of forming such autocorrelation coefficients from the power density spectrum is to determine a group of auxiliary values from the power density spectrum for each period, which auxiliary values represent the autocorrelation coefficients from the lowest frequency up to a given higher frequency. These auxiliary values are stored in a Table and associated with the respective higher frequency. An autocorrelation coefficient for a given frequency segment is then determined from the difference between two values in the Table. The latter process only requires a simple computation, while the Table is determined only once in each period with a boundary computation time.
The optimum boundary frequencies, at which the sum of the error values is minimal, are essentially determined in accordance with the principle of dynamic programming. For this purpose, a further auxiliary value is used which represents the error for the optimum division of the frequency segment from the lowest frequency up to a higher frequency into a given number of segments. Consecutive, higher frequencies are subdivided into two frequency intervals, with the interval boundary stepwise assuming all frequencies, and when this auxiliary value is larger than the sum of the auxiliary value achieved at the previous boundary of the segment and the error value for the range between the interval boundary and the instantaneous higher frequency, then the new error value is set to this sum value and the associated interval boundary is stored simultaneously. When this has been effected for all higher frequencies up to the maximum frequency, the absolute, optimum segment boundarys are then obtained by way of traceback.