This invention relates to a method of determining the start-point and end-point of a word signal corresponding to an isolated utterance in a speech signal by establishing an extreme value in a sequence of digital values derived from the speech signal, taking into account values surrounding the extreme value of the signal variation and a threshold value.
Methods of this type for the determination of the start-point and end-point in a speech signal are used more specifically when the speech signal is formed by isolated utterances or very short word groups and these utterances or word groups, respectively, should be recognized automatically. In almost all applications, the actual word signal in the speech signal is accompanied by interferences and noise and pauses and also by extraneous noise such as loud breathing. In order to provide the highest reliable recognition of the word or words in the speech signal, it is however important to start the identification accurately with the speech signal portion, which also represents the start of the word to be recognized.
Several methods of determining start and end-points are known already. ICASSP 84 Proceedings, 19 to 21 Mar. 1984, San Diego, California describes on pp. 18B.7.4 a method of detecting end-points in a speech signal, which operates with the autocorrelation matrix of the speech signal. To obtain such a matrix requires a significant computational cost and design effort, and the results are not satisfactory in all conditions. U.S. Pat. No. 4,821,325 (4/11/89) uses an end-point detector which subdivides the speech signal into overlapping blocks. These blocks are however fixed, independently of the variation of the speech signal, and the block having the maximum energy is determined and the preceding block having an energy level below a threshold value, which is located below the maximum energy to a predetermined extent. By means of further expensive steps a number of such maxima and their duration are established and energy maxima of a longer duration are calculated therefrom. Furthermore, a reliable end-point recognition then is difficult and unreliable when high-level interferences are superimposed on the speech signal.