This invention relates to a method of recognizing at least one word string in a speech signal, from which test signals characteristic of consecutive time intervals are derived. These test signals are compared with reference signals of a plurality of given words stored in a first memory in order to form difference values which are summed. The difference sum is stored in a second memory together with a pointer to the memory address where the sequence of difference sums thus obtained has started at the beginning of a word. At least at word boundaries a pointer to the word just ended and to the point where said word begins is stored in a third memory, and at least one word string is determined at the end of the speech signal, starting from at least that word for which the smallest difference sum has been obtained, via the beginning of this word then stored, from the pointer to the preceding word and to its beginning etc. stored there. The invention further relate to an arrangement for carrying out the method.
Such a method is known from DE-OS 32 15 868. In this known method the speech signal is compared with different words through the use of dynamic time adaptation, so that, during the recognition process in the course of the speech signal, a plurality of parallel word strings bearing a resemblance to the speech signal are obtained, which resemblance is dictated by the accumulated difference sum within the relevant word string. Finally, upon the last speech signal a plurality of word strings are finished and the word string yielding the smallest accumulated difference sum is supplied to the output as the sole recognized word string.
However, as a result of different pronunciations, for example, as a result of the partial suppression of word endings, the word string thus obtained is not always the string corresponding to the uttered speech signal. Therefore, in order to improve recognition, it has been proposed to employ speech models which, in conformity with the rules of natural speech, restrict the choice of the word or words which can follow a word just finished. Generally this enables the recognition reliability to be improved. Nevertheless, it is not unlikely that ultimately, as a result of similarly sounding words whose sequence each time complies with the rules of natural speech, a word sequence is supplied to the output as a recognized sentence, which sequence is very similar to but is not an accurate representation of the sentence uttered, while a word sequence reaching a slightly larger accumulated difference sum at the end of the speech signal is actually the correct sentence. In many cases it is therefore effective to output not only the word sequence, i.e. the sentence, with the best similarity but also further sentences of next best similarity, in particular if the word sequence found as the best appears to be incorrect, for example, on the basis of other sources of knowledge which, for example for reasons of complexity, have to be ignored in the recognition process.
By means of the known method this is not readily possible because for every compared word at the end of the speech signal only a single preceding word string is stored, so that it is not possible to determine different word strings whose similarity to the speech signal differs only slightly and which end with the same word.