1. Field of the Invention
The present invention relates to an apparatus for generating a statistical class sequence model called a class bi-multigram model from input strings of discrete-valued units, where bigram dependencies are assumed between adjacent sequences including first and second sequences, the first sequence consisting of a variable number of units N.sub.1 and the second sequence consisting of a variable number of units N.sub.2, and where class labels are assigned to the sequences regardless of their length, also relates to an apparatus for generating a statistical class language model which is applied from the apparatus for generating the statistical class sequence model, and further relates to a speech recognition apparatus using the apparatus for generating the statistical language model.
2. Description of the Prior Art
The role of a statistical language model in the context of speech recognition is to provide an estimate of the likelihood of any string of words. This likelihood value is used within a speech recognition system to help select the sentence most likely uttered by the speaker. A statistical language model specifies the kind of dependencies assumed between the words in a sentence. Based on the model's assumptions, the likelihood of a sentence can be expressed as a parametric function of the words forming the sentence. The language model is fully specified once all the parameters of the likelihood function have been estimated using a given optimization criterion. So far the most widely used language model is the so-called N-gram model (N being a natural number), where the assumption is made that each word in a sentence depends on the (N-1) preceding words. As a result of this assumption, the likelihood function of a sentence W=w.sub.1.sup.L ==w.sub.1, w.sub.2, . . . , W.sub.L is computed as follows: ##EQU1##
where wt is the t-th word in the sentence W. The parameters of the N-gram model are the conditional probabilities p(w.sub.iN.vertline.W.sub.i1, W.sub.i2, . . . , W.sub.iN-1) for all the words W.sub.i1, . . . , W.sub.iN of the vocabulary. These conditional probabilities are usually estimated according to a Maximum Likelihood criterion, as the relative frequencies of all the N-uplets of words observed in some large training database. However, the size of the training database being necessary limited, all possible combinations of words cannot be observed a sufficient number of times to allow to collect reliable statistics. PA1 (a) Prior Art Reference 1: Klaus Ries et al., "Class phrase models for language modeling", Proceedings of ICSLP 96, 1996; PA1 (b) Prior Art Reference 2: Hirokazu Masataki et al., "Variable-order n-gram generation by word class splitting and consecutive word grouping", Proceedings of ICASSP 96, 1996; PA1 (c) Prior Art Reference 3: Shoichi Matsunaga et al., "Variable-length language modeling integrating global constraints", Proceedings of EUROSPEECH 97, 1997; and PA1 (d) Prior Art Reference 4: Sabine Deligne et al., "Introducing statistical dependencies and structural constraints in variable length sequence models", in Grammatical Inference: Learning Syntax from Sentences, Lecture Notes in Artificial Intelligence 1147, pp. 156-167, Springer 1996.
One limitation of the N-gram model is the assumption of fixed length dependency between the words, which obviously is not a valid assumption for natural language data. Besides, increasing the value of N to capture longer spanning dependencies between the words and thus increase the predictive capability of the model results in considerably increasing the size of the model in terms of the number of N-gram entries, which makes it difficult to get reliable estimates for all N-gram probabilities and furthermore increases the complexity of the search algorithm during speech recognition.
As far as modeling assumptions are concerned, phrase based models can be either deterministic or stochastic. In a deterministic model, there is no ambiguity on the parse of a sentence into phrases, whereas in a stochastic model various ways of parsing a sentence into phrases remain possible. For this reason, stochastic models can be expected to evidence better generalization capabilities than deterministic models. For example, assuming that the sequence [bcd] is in the inventory of sequences of the model, then, in the context of a deterministic model, the string "b c d" will be parsed as being a single sequence "[bcd]". On the other hand, in the context of a stochastic model, the possibility of parsing the string "b c d" as "[b] [c] [d] ", "[b] [cd]" or "[bc] [d]" also remain. Class versions of phrase based models can be defined in a way similar to the way class version of N-gram models are defined, i.e., by assigning class labels to the phrases. In prior art it consists in first assigning word class labels to the words, and in then defining a phrase class label for each distinct phrase of word class labels. A drawback of this approach is that only phrases of the same length can be assigned the same class label. For example, the phrases "thank you" and "thank you very much" cannot be assigned the same class label, because being of different lengths, they will lead to different sequences of word class labels.
The following prior Art References disclose phrase based models and/or class phrase based models:
Prior Art References 1, 2 and 3 disclose deterministic models, wherein there is only one way to parse the input strings of units. This approach can be expected to demonstrate little generalization ability, because unseen test strings are forced to be parsed following the way the training strings were parsed, and this parsing may not any optimal one.
Prior Art References 1, 2 and 3 disclose models, the parameters of which are estimated with heuristic procedures, namely greedy algorithms where words are incrementally grouped into sequences, for which monotone convergence towards an optimum cannot be theoretically guaranteed.
Prior Art Reference 4 discloses a stochastic sequence model, where no means of classification is provided to assign class labels to the sequences.
Prior Art Reference 1 discloses a class sequence model, where the estimation of the sequence distribution and the assignment of the class labels to the sequences are performed independently, so that there is no guarantee that the estimation of the sequence distribution and the assignment of the class labels to the sequences are optimal with respect to each other.
Prior Art Reference 1 discloses a class sequence model, where sequences of the same length only can be assigned the same sequence class label. For example, the sequence "thank you for" and "thank you very much for" cannot share a common class model, because they are of different lengths.