This invention relates to a method of recognizing a speech signal which is derived from coherently spoken words and consists of a temporal sequence of speech values, each of which indicates a section of the speech signal, the speech values being compared with given stored comparison values, of which each time a group of comparison values represents a word of a given vocabulary, and the comparison results are summed up over different sequences of combinations of comparison values and speech values to a distance sum per sequence.
Such a method is known from DE-OS No. 3,215,868 and from the magazine "I.E.E.E. Transactions on Acoustics, Speech and Signal Processing", Vol. Assp-32, No. 2, Apr. 1984, pp. 263 to 271. Consequently, a larger number of different sequences of comparison values and hence of words are permanently followed because it is possible that a sequence which accidentally does not exhibit the smallest distance sum nevertheless proves in the end to be the most suitable upon comparison of the further speech values. In the known methods, as speech values the prior art predominantly uses sample values of the speech signal, which were produced at a 10 ms distance and were decomposed into their spectral values. However, other measures for processing the sampled speech signals may also be used. Likewise, the speech values may also be obtained from several sample values and may represent, for example, diphones or phonemes or even larger units, which does not essentially change the method.
In the second of the aforementioned documents, it is indicated that it is effective to provide syntactical limitations, more particularly in a large vocabulary, in order to increase the certainty and reliability of recognition. These limitations become effective each time at the word transitions and they essentially consist of a speech model in the form of a network which is taken into account, i.e. in the sense of the formal languages a regular grammar. Such a speech model is comparatively rigid and inflexible, however, with the result that only sentences constructed in a given manner can be recognized if the number of possibilities provided for sentence constructions in the speech model is not to assume excessively large values.
From the theory of the formal languages, a further class, i.e. the context-free grammar is known, which is more flexible and can record more satisfactorily the structure of real spoken sentences.