The invention relates to a method and a configuration for forming classes for a language model based on linguistic classes using a computer.
A method for speech recognition is known from the reference by G. Ruske, titled xe2x80x9cAutomatische Spracherkennungxe2x80x94Methoden der Klassifikation und Merkmalsextraktionxe2x80x9d [xe2x80x9cAutomatic Speech Recognitionxe2x80x94Methods of Classification and Feature Extractionxe2x80x9d], Oldenbourg Verlag, Munich 1988, ISBN 3-486-20877-2, pages 1-10. It is customary in this case to specify the usability of a sequence of at least one word as a component of word recognition. A probability is one measure of this usability.
A statistical language model is known from the reference by L. Rabiner, B. -H. Juang, titled xe2x80x9cFundamentals of Speech Recognitionxe2x80x9d, Prentice Hall 1993, pages 447-450. Thus, the probability P(W) of a word sequence W within the framework of speech recognition, preferably large quantities of vocabulary, generally characterizes a (statistical) language model. The probability P(W) (known as word sequence probability) is approximated by an N-gram language model PN(W):                                                         P              N                        ⁡                          (              W              )                                =                                    ∏                              i                =                0                            n                        ⁢                          P              ⁡                              (                                                                            w                      i                                        |                                          w                                              i                        -                        1                                                                              ,                                      w                                          i                      -                      2                                                        ,                  …                  ⁢                                      xe2x80x83                                    ,                                      w                                          i                      -                      N                      +                      1                                                                      )                                                    ,                            (0-1)            
where
wi denotes the ith word of the sequence W with (i=1 . . . n), and
n denotes the number of words w1 in the sequence W.
What are called bigrams result from equation (0-1) for N=2.
It is also known in speech recognition, preferably in the commercial field, to use an application field (domain) of limited vocabulary. Texts from various domains differ from one another not only with regard to their respective vocabulary, but also with regard to their respective syntax. Training a language model for a specific domain requires a correspondingly large set of texts (text material, text body), which is, however, only rarely present in practice, or can be obtained only with an immense outlay.
A linguistic lexicon is known from the reference by F. Guethner, P. Maier, titled xe2x80x9cDas CISLEX-Wxc3x6rterbuchsystemxe2x80x9d [xe2x80x9cThe CISLEX Dictionary Systemxe2x80x9d], CIS-Bericht [CIS report] 94-76-CIS, University of Munich, 1994. The reference is a collection, available on a computer, of as many words as possible in a language for the purpose of referring to linguistic properties with the aid of a search program. For each word entry (xe2x80x9cword full formxe2x80x9d), it is possible to extract the linguistic features relevant to this word full form and the appropriate assignments, that is to say the linguistic values.
The use of linguistic classes is known from the reference by P. Witschel, titled xe2x80x9cConstructing Linguistic Oriented Language Models for Large Vocabulary Speech Recognitionxe2x80x9d, 3rd EUROSPEECH 1993, pages 1199-1202. Words in a sentence can be assigned in different ways to linguistic features and linguistic values. Various linguistic features and the associated values are illustrated by way of example in Table 1 (further examples are specified in this reference).
On the basis of linguistic features
(f1, . . . , fm)xe2x80x83xe2x80x83(0-2)
and linguistic values
(v11 . . . v1j) . . . (vm1 . . . vmj)xe2x80x83xe2x80x83(0-3)
each word is allocated at least one linguistic class, the following mapping rule F being applied:
(C1, . . . , Ck)=F((f1, v11, . . . , v1j) . . . (fm, vm1, . . . , vmj))xe2x80x83xe2x80x83(0-4)
where
fm denotes a linguistic feature,
m denotes a number of linguistic features,
vm1 . . . vmj denotes the linguistic values of the linguistic feature fm,
j denotes the number of linguistic values,
Ci denotes the linguistic class with i=1 . . . k,
k denotes the number of linguistic classes, and
F denotes a mapping rule (classifier) of linguistic features and linguistic values onto linguistic classes.
The class of the words with linguistic properties which are unknown or cannot be otherwise mapped constitutes a specific linguistic class in this case.
An example is explained below for the purpose of illustrating the linguistic class, the linguistic feature, the linguistic value and the class bigram probability.
The starting point is the German sentence:
xe2x80x9cthe Bundestag is continuing its debatexe2x80x9d.
The article xe2x80x9cthe (English) or der (German)xe2x80x9d (that is to say the first word) can be subdivided in German into six linguistic classes (from now on, only: classes), the classes being subdivided into number, gender and case. The following Table 2 illustrates this correlation:
Table 3 follows similarly for the German substantive xe2x80x9cBundestagxe2x80x9d (second word in the above example sentence):
It now follows in this example with regard to class bigrams, that is bigrams applied to linguistic classes, that the class Ci, followed by the class C7, constitutes the correct combination of category, number, case and gender with reference to the example sentence. If frequencies of actually occurring class bigrams are determined from prescribed texts, it follows that the above class bigram C1-C7occurs repeatedly, since this combination is present frequently in the German language, whereas other class bigrams, for example the combination C2-C8 is not permissible in the German language because of different genders. The class bigram probabilities resulting from the frequencies found in this way are correspondingly high (in the event of frequent occurrence) or low (if not permissible).
The reference by S. Martin, J. Liermann, H. Ley, titled xe2x80x9cAlgorithms for Bigram and Trigram Word Clusteringxe2x80x9d, Speech Communication 24, 1998, pages 19-37, proceeds from statistical properties in forming classes. Such classes have no specific linguistic properties which can be appropriately used in the language model.
The conventional formation of classes is performed manually by employing linguists who sort a language model in accordance with linguistic properties. Such a process is extremely lengthy and very expensive, because of the experts.
It is accordingly an object of the invention to provide a method and a configuration for forming classes for a language model based on linguistic classes which overcome the above-mentioned disadvantages of the prior art methods and devices of this general type, permitting classes to be formed automatically and without the use of expert knowledge for a language model based on linguistic classes.
With the foregoing and other objects in view there is provided, in accordance with the invention, a method for forming classes for a language model based on linguistic classes using a computer. The method includes the steps of using a first mapping rule to determine N classes using a prescribed vocabulary with associated linguistic properties, determining K classes from the N classes by minimizing a language model entropy, and using the K classes to represent a second mapping rule for forming the classes of language models onto the linguistic classes.
In order to achieve the object, a method is specified for forming classes for a language model based on linguistic classes using a computer, in which a first mapping rule is used to determine a number N of classes by a prescribed vocabulary with associated linguistic properties. K classes are determined from the N classes (K less than N) by minimizing a language model entropy. A second mapping rule, the formation of classes of the language model, is represented by the K classes.
It is advantageous in this case that classes can be formed in a completely automated fashion. No long-winded manual assignment is undertaken by specifically trained experts, nor is the linguistic significance of the classes undermined by statistical measures. The condition that K be less than N substantially reduces the amount of classes, and thus determines an effective language model.
A development consists in that the N classes are determined in that all possible combinations of linguistic features and associated linguistic values are determined, and each of the combinations leads to a dedicated linguistic class. The number N is therefore determined by the maximum possible number of classes (referred to the basic text).
Another development is to use a linguistic lexicon to determine the linguistic values. Such a linguistic lexicon is available, inter alia, for the German language (see the reference by F. Guethner, P. Maier, titled xe2x80x9cDas CISLEX-Wxc3x6rterbuchsystemxe2x80x9d [xe2x80x9cThe CISLEX Dictionary Systemxe2x80x9d], CIS-Bericht [CIS report] 94-76-CIS, University of Munich, 1994).
Also specified for achieving the object is a method for forming classes for a language model based on linguistic classes by a computer in which a first mapping rule is used to prescribe N classes. K classes are determined from the N classes by minimizing a language model entropy. The K classes are used to represent a second mapping rule for forming classes of language models which are based on linguistic classes.
The K classes are determined in an additional development by carrying out the following steps:
a) a number M of the most probable of the N classes are determined as base classes; and
b) one of the remaining (Nxe2x88x92M) classes is merged with that base class for which the language model entropy is minimized.
In this case, the M most probable classes (referred to the basic text) are determined. The above steps can also be carried out by iteration for a plurality of or all the remaining (Nxe2x88x92M) classes.
One embodiment consists in that the language model entropy is determined by the equation                                           H            ⁡                          (              LM              )                                =                                                    -                                  1                  n                                            ·              log                        ⁢                          xe2x80x83                        ⁢                          P              ⁡                              (                W                )                                                    ,                            (        1        )            
where
H(LM) denotes the language model entropy of the language model,
n denotes the number of words in the text,
W denotes a chain of words w0, w1, . . . , Wn, and
P(W) denotes a probability of the occurrence of a sequence of at least two words.
Another embodiment consists in that the method described for determining a probability of the occurrence of a sequence of at least two words is used in speech recognition. A language has linguistic classes
(C1, . . . Ck)xe2x80x83xe2x80x83(2)
in accordance with
(C1, . . . , Ck)=F((f1, v11, . . . v1j) . . . (fm, mm1, . . . vmj))xe2x80x83xe2x80x83(3)
where
fm denotes a linguistic feature,
m denotes the number of linguistic features,
vm1 . . . vmj denotes the linguistic values of the linguistic feature fm,
j denotes the number of linguistic values,
Ci denotes the linguistic class with i=1 . . . k,
k denotes the number of linguistic classes, and
F denotes a mapping rule (classifier) of linguistic features and linguistic values onto linguistic classes.
At least one of the linguistic classes is assigned to a word in this case. A probability P(W) of the occurrence of the sequence of at least two words is yielded using bigrams as                               P          ⁡                      (            W            )                          ≈                              ∏                          i              =              1                        n                    ⁢                                    ∑                              C                i                                      ⁢                                          ∑                                  C                                      i                    -                    1                                                              ⁢                                                P                  ⁡                                      (                                                                  w                        i                                            |                                              c                        i                                                              )                                                  xc3x97                                  P                  ⁡                                      (                                                                  C                        i                                            |                                              C                                                  i                          -                          1                                                                                      )                                                  xc3x97                                  P                  ⁡                                      (                                                                  C                                                  i                          -                          1                                                                    |                                              w                                                  i                          -                          1                                                                                      )                                                                                                          (        4        )            
where
W denotes the sequence of at least two words,
wi denotes the ith word of the sequence W with (i=1 . . . n),
n denotes the number of words wi in the sequence W,
Ci denotes a linguistic class which belongs to a word wi,
Cixe2x88x921 denotes a linguistic class which belongs a word wixe2x88x921,
xcexa3Ci denotes the sum of all linguistic classes C which belong to a word wi,
P(wi|Ci) denotes the conditional word probability,
P(Ci|Cixe2x88x921) denotes the probability of bigrams (also: class bigram probability), and
P(Cixe2x88x921|wixe2x88x921) denotes the conditional class probability.
It may be noted here that the term Ci relates to one of the at least one linguistic class which is assigned to the word wi from the word sequence W. The same holds correspondingly for the term Cixe2x88x921. For example, the class bigram probability is the probability that the word wi belongs to a first linguistic class under the condition that the preceding word wixe2x88x921, belongs to a second linguistic class (see the introductory example with explanation on this point).
The probabilities P(wi|Ci) and P(Ci|Cixe2x88x921), which yield a so-called basic language model when input into equation (4), can be determined from a text body, that is to say from a prescribed text of prescribed size.
Language models which are based on linguistic classes offer decisive advantages, in particular for adaptation. The method presented here uses the linguistic properties contained in the language models.
One development consists in that for a new text a it predetermined basic language model is used to take over the probability P(Ci|Cixe2x88x921) into the basic language model for the new text.
Probabilities for class bigrams of the basic language model (see the reference by P. Witschel, titled xe2x80x9cConstructing Linguistic Oriented Language Models for Large Vocabulary Speech Recognitionxe2x80x9d, 3rd EUROSPEECH 1993, pages 1199-1202 and the explanation in the introduction) constitute a grammatical structure for the training text, and are independent of the vocabulary. Assuming that the new domain of similar text structure (grammatical structure) is like the original training text for the basic language model, it is expedient to take over the probability for the class bigrams P(Ci|Cixe2x88x92l) unchanged from the basic language model.
The vocabulary for the new domain, for which a language model is determined, is processed with the aid of a prescribed linguistic lexicon and employing a classifier F in accordance with equation (3). At least one linguistic class is automatically determined for each new word from the text. See the reference by P. Witschel, titled xe2x80x9cConstructing Linguistic Oriented Language Models for Large Vocabulary Speech Recognitionxe2x80x9d, 3rd EUROSPEECH 1993, pages 1199-1202 for a detailed description of linguistic classes, linguistic features and linguistic values, and the reference by F. Guethner, P. Maier, titled xe2x80x9cDas CISLEX-Wxc3x6rterbuchsystemxe2x80x9d [xe2x80x9cThe CISLEX Dictionary Systemxe2x80x9d], CIS-Bericht [CIS report] 94-76-CIS, University of Munich, 1994 for the linguistic lexicon, and/or the introduction, in each case.
Another development relates to determining the probability P(wi|Ci) according to at least one of the following possibilities:
a) the probability P(wi|Ci) is determined with the aid of the text;
b) the probability P(wi|Ci) is determined for a word wi with the aid of a prescribed probability P(wi); and
c) the probability P(wi|Ci) is determined by using a word list.
An additional development relates in that the determined probability P(wi|Ci) is used to adapt the basic language model. This is preferably performed in such a way that these determined probabilities P(wi|Ci) are adopted into the basic language model.
A further development is to determine the probability P(Cixe2x88x921|wixe2x88x921) with the aid of the probability P(wi|Ci) as follows:
P(Ci|wi)=Kxc3x97P(wi|Ci)xc3x97P(Ci)xe2x80x83xe2x80x83(5)
where                     K        =                              (                                          ∑                                  C                  i                                            ⁢                                                P                  ⁡                                      (                                                                  w                        i                                            |                                              C                        i                                                              )                                                  xc3x97                                  P                  ⁡                                      (                                          C                      i                                        )                                                                        )                                -            1                                              (        6        )            
denotes a normalization factor.
Another development relates to recognizing an appropriate sequence of at least one word if the probability P(W) is above a prescribed bound. A prescribed action is carried out if this is not the case. The prescribed action, is for example, outputting an error message or stopping the method.
In another development, the text relates to a prescribed application field, what is termed a (language, application) domain.
It is particularly advantageous in this case that the method presented requires a new text of only small size to determine a language model of a new domain.
It is also advantageous that lists of new words (with or without specification of the probability P(wi)) can be used. Domain-referred speech recognition plays an important role in practice. The method therefore meets a real demand and has proved in experiments to be suitable and extremely useful. Going back to the basic language model, there is a substantial reduction in the number of probabilities to be estimated anew (estimation only of P(wi|Ci) necessary).
Furthermore, in order to achieve the object, a configuration for forming classes for a language model based on linguistic classes is specified which has a processor unit, which processor unit is set up or programmed in such a way that:
a) using a first mapping rule, a number N of classes can be determined by use of a prescribed vocabulary with associated linguistic properties;
b) K classes are determined from the N classes by minimizing a language model entropy; and
c) the K classes are used to produce a second mapping rule for forming classes of language models into linguistic classes.
Also specified for the purpose of achieving the object is a configuration for forming classes for a language model based on linguistic classes in the case of which a processor unit is provided which is set up or programmed in such a way that:
a) N classes can be prescribed using a first mapping rule;
b) K classes are determined from the N classes by minimizing a language model entropy; and
c) the K classes are used to produce a second mapping rule for forming classes of language models into linguistic classes.
These configurations are particularly suitable for carrying out the method according to the invention or one of its previously explained developments.
Other features which are considered as characteristic for the invention are set forth in the appended claims.
Although the invention is illustrated and described herein as embodied in a method and a configuration for forming classes for a language model based on linguistic classes, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made therein without departing from the spirit of the invention and within the scope and range of equivalents of the claims.
The construction and method of operation of the invention, however, together with additional objects and advantages thereof will be best understood from the following description of specific embodiments when read in connection with the accompanying drawings.