The invention relates to a method for producing reference segments describing speech modules and a method for modeling speech units of a spoken test model in voice recognition systems.
Earlier commonly encountered speech recognition systems are based on the dynamic time warping (DTW) principle. In this situation, for each word a complete sequence of characteristic vectors—obtained from a training utterance for this word—is saved as a reference model and compared in the operational phase with a test model of a voice signal to be recognized by non-linear mapping. The comparison serves to determine the minimum distance between the respective reference model and the test model, and the reference model having the smallest distance from the test model is selected as the reference model that suitably describes the test model.
The disadvantage with this method is the fact that a reference model needs to be stored for every word to be recognized, as a result of which the code book containing the reference models is extremely extensive and the effort involved in training a voice recognition system of such a type whereby a reference model is saved for every word is correspondingly great. In this situation, it is not possible to generate reference models for words differing from the learned language vocabulary. According to the present publication, the characteristics representing the reference models, which are obtained by the auto-correlation function in each case for successive analysis windows at a distance of 10 ms for example, and are subsequently referred to as auto-correlation characteristics, and the spectral characteristics are explained.
The auto-correlation characteristics describe the voice signal contained in the analysis window in the time range and the spectral characteristics, which are obtained by a Fourier transformation, describe the voice signals in the frequency range. In addition, several different distance measurements for determining a distance between two characteristic vectors are explained. In order to improve speaker-independent recognition, with regard to this known method a plurality of reference models is produced for each word, whereby the reference models are in turn ascertained as an averaged value from a plurality of training signals. In this situation, both the time structure of the entire reference model and also the characteristic structure can be ascertained as averaged values. In order to produce groups of reference models which are assigned to a word in each case and exhibit an averaged time structure, training models are mapped in non-linear fashion to an averaged model assigned to this word or word class and then a clustering of the characteristic vectors for the training model and of the reference models already present in the class is carried out separately for each analysis window.
By using this special method it is possible to achieve an extremely good recognition rate, but it is however subject to the disadvantages of the DTW method already described above.
More recent voice recognition systems are based on the HMM method (hidden Markov modeling). In this situation, in the training phase voice segments (for example phonemes or syllables) are collected from a large number of voice signals from different words and are subdivided into nodes (for example one node each per word-initial/word-internal/word-final sound). The characteristic vectors describing the voice signals are assigned to the node and stored in a code book.
With regard to speech recognition, the test model is mapped by a non-linear mapping process (for example with the aid of the Viterbi algorithm) onto a sequence of nodes defined by the transcription (for example a phonetic description) of the word. Since the nodes only describe word segments, reference models for practically any desired word of a language can be produced by concatenation of the nodes or segments. Since as a rule there are normally distinctly fewer phonemes or syllables than words in a language, the number of nodes is significantly less than the number of reference models describing complete words to be stored with regard to the DTW method. As a result, the training effort required for the voice recognition system is significantly reduced when compared with the DTW method.
A disadvantage with this method is however the fact that the timing sequence of characteristic vectors can no longer be ascertained within a node. This is a problem particularly in the case of long segments—such as an extended German “a” for example, in which instances a very large number of characteristic vectors of similar nodes frequently fit although the timing sequence of the vectors does not match. As a result, the recognition rate can be seriously impaired.
In Aibar P. et al.: “Multiple template modeling of sublexical units”, in: “Speech Recognition and Understanding”, pp. 519 to 524, Springer Verlag, Berlin, 1992 and also in Castro M. J. et al.: “Automatic selection of sublexic templates by using dynamic time warping techniques”, and in: “Proceedings of the European Signal Processing Conference”, Vol. 5, No. 2, pp. 1351 to 1354, Barcelona, 1990 a segmentation of a training voice signal into speech modules and an analysis for obtaining a characteristic vector are described. In this situation, averaging is also carried out. In Ney H.: “The use of a one-stage dynamic programming algorithm for connected word recognition”, and in: “IEEE Transactions of Acoustics, Speech, and Signal Processing”, pp. 263 to 271, Vol. ASSP-32, No. 2, 1984 the recognition of words in a continuously 3a uttered sentence is disclosed, whereby a reference template is used for each word.