In speech recognition systems being based on a statistical language model approach instead of being knowledge based, for example the English speech recognition system TANGORA developed by F. Jelinek et al. at IBM Thomas J. Watson Research Center in Yorktown Heights, USA, and published in Proceedings of IEEE 73(1985)11, pp.1616-24), entitled "The development of an experimental discrete dictation recognizer", the recognition process can be subdivided into several steps. The tasks of these steps depicted in FIG. 1 (from article by K. Wothke, U. Bandara, J. Kempf, E. Keppel, K. Mohr, G. Walch (IBM Scientific Center Heidelberg), entitled "The SPRING Speech Recognition System for German", in Proceedings of Eurospeech 89, Paris 26.-28.IX.1989), are
extraction of a sequence of so-called acoustic labels from the speech signal by a signal processor; PA1 fast and detailed acoustic match to find those words which are most likely to produce the observed label sequence; PA1 computation for a sequence of words the probability of their occurrence in the language by means of a statistical language model. PA1 1. The required disk capacity is large and thus the hardware costs of the recognizer unit is expensive; PA1 2. the speed performance of the recognizer becomes increasingly poor due to the long retrieval delay for searching in a large file; PA1 3. it is difficult to port the speech recognizer software to smaller and cheaper computers with relatively slow processor power, e.g. Laptops.
The whole system can be either implemented on a digital computer, for example a personal computer (PC), or implemented on a portable dictaphone or a telephone device. The speech signal is amplified and digitized, and the digitized data are then read into a buffer memory contained for example in the signal processor. From the resulting frequency spectrum a vector of a number of elements is taken and the spectrum is adjusted to account for an ear model.
Each vector is compared with a number of (say 200) speaker dependent prototype vectors. The identification number which is called an acoustic label, of the most similar prototype vector, is taken and sent to the subsequent processing stages. The speaker dependent prototype vectors are generated from language specific prototype vectors during a training phase for the system with a speech sample.
The fast acoustic match determines for every word of a reference vocabulary the probability with which it would have produced the sequence of acoustic labels observed from the speech signal. The probability of a word is calculated until either the end of the word is reached or the probability drops below a pre-specified level. The fast match uses as reference units for the determination of this probability a so-called phonetic transcription for each word in the reference vocabulary, including relevant pronunciation variants, and a hidden Markov model for each allophone used in the phonetic transcription. The phonetic transcriptions are generated by use of a set of phoneticization rules (l.c.)
The hidden Markov model of an allophone describes the probability with which a substring of the sequence of acoustic labels corresponds to the allophone. The Markov models are language specific and the output and transition probabilities are trained to individual speakers. The Markov model of the phonetic transcription of a word is the chain of the Markov models of its allophones.
The statistical language model is one of the most essential parts of a speech recognizer. It is complementary to the acoustic model in that it supplies additional language-based information to the system in order to resolve the uncertainty associated with the word hypothesis proposed by the acoustic side. In practice, the acoustic side proposes a set of possible word candidates with the probabilities being attached to each candidate. The language model, on the other hand, predicts the possible candidates with corresponding probabilities. The system applies maximum likelihood techniques to find the most probable candidate out of these two sets of candidates.
For the purpose of supplying this language-based information, the language model uses a priori computed relative frequencies for word sequences, for practical reasons usually consisting of three words, i.e. trigrams, that are word triplets `w.sub.1 w.sub.2 w.sub.3 `. It is hereby assumed that the probability for `w.sub.3 ` to occur depends on the relative frequencies for `w.sub.3 ` (unigrams), `w.sub.2 w.sub.3 ` (bigrams), and `w.sub.1 w.sub.2 w.sub.3 ` (trigrams) in a given text corpus. For the computation of those frequencies a very large authentic text corpus from the application domain, e.g. real radiological reports or business correspondences, is needed.
The language model receives from the fast acoustic match a set of word candidates. For each of these candidates it determines the probability with which it follows the words which have already been recognized. For this purpose the language model uses probabilities of single words, word pairs, and word triples. These probabilities are estimated for all words in the vocabulary using large text corpora. The word candidates with the highest combined probabilities supplied by the fast match and the language model are selected and passed to the detailed match.
The detailed acoustic match computes for each word received from the language model the probability with which it would produce the acoustic label sequence observed. In contrast to the fast acoustic match, the detailed match does not perform this process for all words of the reference vocabulary but only for those received from the language model, and that it does not use phonetic transcriptions and hidden Markov models of allophones as reference units. Instead, the detailed match uses hidden Markov models of so-called fenemic phones which are artificial sound units which usually correspond to one acoustic label.
The three probabilities of the fast match, of the language model, and of the detailed match are then combined for the most likely sequences. At the end of each hypothesis the fast match, the language model, and the detailed match are started again.
In the field, the foregoing approaches use about 20.000 words to cover at least 95% of words uttered. A large text corpus from the domain is analyzed to get the relative frequencies for all of the occurring unigrams, bigrams and trigrams. The number of theoretically possible trigrams for a vocabulary of 20.000 words is 20.000.sup.3 =9.times.10.sup.12. Only a small fraction of this amount is actually observed. Even then about 170 MB disk capacity is required by the speech recognizer to store a language model file which contains all the trigrams and the corresponding frequencies. This file is used during run time.
There are three adverse effects due to the large size of the language model file:
For the above reasons, the size of the language model used in these prior art speech recognition technology is a trade-off between the retrieval delay and the recognition accuracy. According to this kind of approaches the language model file is compressed by discarding trigrams which have occurred only at a frequency less than a given threshold, e.g. three times less. Thereby the assumption is made that if a certain trigram occurs very seldom in the corpus, then this trigram will most likely not be uttered by the speaker. This approach results in squeezing the size of the language model to achieve high processing speeds, but at a potential loss of recognition accuracy.
During real field applications it is observed that the above assumption is not realistic. In many cases, some trigrams were observed only once not because they are seldom, but the size of the evaluated text corpus was very limited. However, the speakers do utter those socalled singleton trigrams.
There are further prior art techniques, for example a method for compressing a fast match table to save memory space as described in an article by M. Nishimura, in IBM Technical Disclosure Bulletin (TDB), No. 1, June 1991, pp.427-29 and entitled "Method for Compressing a Fast Match Table". A further approach relates to a method for compressing a library containing a large number of model utterances for a speech recognition system, which was published by H. Crepy, in IBM TDB, No. 2, February 1988, pp.388-89. The first article discloses a solution by use of a binary tree coding algorithm, the second article by use of common data compression techniques. Particularly, both approaches concern squeezing of the acoustic part of a speech recognizer, and not of the language model size.
The above approaches for compression of language model files have led to compact models in the past, but the resulting recognition error rate was considerably high, because the users have uttered the discarded trigrams, while they were not being supported by the language model. Those systems had to depend solely on the acoustic side. This immediately led to recognition errors for acoustically identical or similar words, e.g. `right`-`write` or `da.beta.`-`das`.
The problem underlying the present invention, therefore, is to provide a mechanism for speech recognizers of the above characteristics which allows a strong reduction of the size of the language model, but avoids the beforehand discussed disadvantages.