The invention relates to a method of deriving at least one sequence of words from a speech signal which represents an expression spoken in natural speech, the individual words of the expression not necessarily being spoken with speech intervals inserted therebetween. Merely a vocabulary is given, so that only words of this vocabulary can be derived. The derived sequence of words should correspond exactly to the spoken sequence of words. Therefore, such methods are also called speech recognition methods. The invention also relates to an arrangement for deriving at least one sequence of words.
In a method of this kind which is known from published European patent application EP 614 172 A2, to which U.S. Pat. No. 5,634,083 corresponds, individual words are derived by comparison with reference signals where each reference signal corresponds to a spoken word in the vocabulary, the derived words being combined so as to form a word graph. Because complete correspondence to the reference signals hardly ever occurs in practice, a plurality of similarly sounding words is derived, simultaneously or with a time overlap, each of said words being assigned a respective score in conformity with the degree of correspondence to the respective reference signals. The sequence of consecutive words for which the smallest sum of scores occurs is output as the most probably spoken word sequence.
However, it often occurs that, because of a non-optimum pronunciation, the sequence having the smallest sum of scores is not exactly the actually spoken word sequence, since the word graph of the latter has a higher sum of scores. In order to enable such an actually spoken word sequence to be output nevertheless, it is known in principle to derive a plurality of word sequences from a speech signal for which the probability of correspondence with the speech signal is step-wise lower for successive sequences. For example, an operator can then select the actually spoken word sequence from this plurality of word sequences. A further application for the output of different word sequences with a decreasing probability of correspondence to the speech signals concerns dialogue systems in which the word sequence output is used for an automatic database enquiry. Therein, the word sequences recognized as being most probable could lead to meaningless or non-interpretable database enquiries whereas a word sequence of lower probability leads to a useful database enquiry; therefore, it may be assumed that such a word sequence best corresponds to the actually spoken sentence.
The generation of a plurality of word sequences of different probability of correspondence to the speech signal, however, is generally very complex from a calculation point of view. The Proceedings ICASSP-91, pp. 701 to 704, Toronto 1991, describe a method for finding multiple sentence hypotheses in which the steps enabling backtracking of the various sentence hypotheses are complex.