The invention relates to a method for creating a vocabulary and/or statistical language model from a textual training corpus for subsequent use by a pattern recognition system.
The invention further relates to a system for creating a vocabulary and/or a statistical language model for subsequent use by a pattern recognition system; the system comprising means for creating the vocabulary and/or statistical language model from a textual training corpus.
The invention also relates to a pattern recognition system for recognising a time-sequential input pattern using a vocabulary and/or statistical language model; the pattern recognition system comprising the system for creating a vocabulary and/or statistical language model from a textual training corpus.
Pattern recognition systems, such as large vocabulary continuous speech recognition systems or handwriting recognition systems, typically use a vocabulary to recognise words and a language model to improve the basic recognition result. FIG. 1 illustrates a typical large vocabulary continuous speech recognition system 100 [refer L. Rabiner, B-H. Juang, xe2x80x9cFundamentals of speech recognitionxe2x80x9d, Prentice Hall 1993, pages 434 to 454]. The system 100 comprises a spectral analysis subsystem 110 and a unit matching subsystem 120. In the spectral analysis subsystem 110 the speech input signal (SIS) is spectrally and/or temporally analysed to calculate a representative vector of features (observation vector, OV). Typically, the speech signal is digitised (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis. Consecutive samples are grouped (blocked) into frames, corresponding to, for instance, 32 msec. of speech signal. Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding (LPC) spectral analysis method is used to calculate for each frame a representative vector of features (observation vector). The feature vector may, for instance, have 24, 32 or 63 components. In the unit matching subsystem 120, the observation vectors are matched against an inventory of speech recognition units. A speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit. A word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references. For systems, wherein a whole word is represented by a speech recognition unit, a direct relationship exists between the word model and the speech recognition unit. Other systems, in particular large vocabulary systems, may use for the speech recognition unit linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones. For such systems, a word model is given by a lexicon 134, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models 132, describing sequences of acoustic references of the involved speech recognition unit. A word model composer 136 composes the word model based on the subword model 132 and the lexicon 134. FIG. 2A illustrates a word model 200 for a system based on whole-word speech recognition units, where the speech recognition unit of the shown word is modelled using a sequence of ten acoustic references (201 to 210). FIG. 2B illustrates a word model 220 for a system based on sub-word units, where the shown word is modelled by a sequence of three sub-word models (250, 260 and 270), each with a sequence of four acoustic references (251, 252, 253, 254; 261 to 264; 271 to 274). The word models shown in FIG. 2 are based on Hidden Markov Models, which are widely used to stochastically model speech and handwriting signals. Using this model, each recognition unit (word model or subword model) is typically characterised by an HMM, whose parameters are estimated from a training set of data. For large vocabulary speech recognition systems involving, for instance, 10,000 to 60,000 words, usually a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units. A HMM state corresponds to an acoustic reference (for speech recognition) or an allographic reference (for handwriting recognition). Various techniques are known for modelling a reference, including discrete or continuous probability densities.
A word level matching system 130 matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints are placed on the matching by using the lexicon 134 to limit the possible sequence of sub-word units to sequences in the lexicon 134. This reduces the outcome to possible sequences of words. A sentence level matching system 140 uses a language model (LM) to place further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model. In this way, the outcome of the unit matching subsystem 120 is a recognised sentence (RS). The language model used in pattern recognition may include syntactical and/or semantical constraints 142 of the language and the recognition task. A language model based on syntactical constraints is usually referred to as a grammar 144.
Similar systems are known for recognising handwriting. The language model used for a handwriting recognition system may in addition to or as an alternative to specifying word sequences specify character sequences.
The grammar 144 used by the language model provides the probability of a word sequence W=w1w2w3 . . . wq, which in principle is given by:
P(W)=P(w1)P(w2|w1). P(w3|w1w2) . . . P(wq|w1w2w3 . . . wq).
Since in practice it is infeasible to reliably estimate the conditional word probabilities for all words and all sequence lengths in a given language, N-gram word models are widely used. In an N-gram model, the term P(wj|w1w2w3 . . . wjxe2x88x921) is approximated by P(wj|wjxe2x88x92N+1 . . . wjxe2x88x921). In practice, bigrams or trigrams are used. In a trigram, the term P(wj|w1w2w3 . . . Wjxe2x88x921) is approximated by P(wj|wjxe2x88x922wjxe2x88x921).
The invention relates to recognition systems which use a vocabulary and/or a language model which can, preferably automatically, be build from a textual training corpus. A vocabulary can be simply retrieved from a document by collecting all different words in the document. The set of words may be reduced, for instance, to words which occur frequently in the document (in absolute terms or relative terms, like relative to other words in the document, or relative with respect to a frequency of occurrence in default documents).
A way of automatically building an N-gram language model is to estimate the conditional probabilities P(wj|wjxe2x88x92N+1 . . . wjxe2x88x921) by a simple relative frequency: F(wjxe2x88x92N+1 . . . wjxe2x88x921wj)/F(wjxe2x88x92N+1 . . . Wjxe2x88x921), in which F is the number of occurrences of the string in its argument in the given textual training corpus. For the estimate to be reliable, F(wjxe2x88x92N+1 . . . wjxe2x88x921wj) has to be substantial in the given corpus. One way of achieving this is to use an extremely large training corpus, which covers most relevant word sequences. This is not a practical solution for most systems, since the language model becomes very large (resulting in a slow or degraded recognition and high storage requirements). Another approach is to ensure that the training corpus is representative of many words and word sequences used for a specific recognition task. This can be achieved by manually collecting documents relevant for a specific category of user, such as a radiologist, a surgeon or a legal practitioner. However, such an approach is not possible for recognition systems targeted towards users whose specific interests are not known in advance. Moreover, if a user develops a new interest, a default provided vocabulary and language model will not reflect this, resulting in a degraded recognition result.
It is an object of the invention to create a vocabulary and/or language model which is better tailored to specific user interests. A further object is to create a vocabulary and/or language model which allows improved or faster recognition.
To achieve the object, the method comprises the steps of determining at least one context identifier; deriving at least one search criterion from the context identifier; selecting documents from a set of documents based on the search criterion; and composing the training corpus from the selected documents. By searching for documents based on a search criterion derived from a context identifier, pertinent documents are collected in an effective way, ensuring that pertinent language elements are covered. This increases the quality of recognition. Moreover, also many irrelevant language elements will not be included, allowing the creation of a relatively small vocabulary or language model. This in turn can lead to a faster recognition or, alternatively, improve the recognition rate by adding more elements, such as acoustic data or allographic data, in other parts of the recognition system.
In an embodiment according to the invention, the context identifier comprises a keyword, which acts as the search criterion. For instance, the (prospective) user of a pattern recognition system specifies one or more keywords, based on which the documents are selected.
In another embodiment according to the invention, the context identifier indicates a sequence of words, such as a phrase or a text. From this sequence of words, one or more keywords are extracted, which act as the search criterion. For instance, the (prospective) user of a pattern recognition system specifies one or more documents representative of his interests. Keywords are extracted from the documents, and additional documents are selected based on the keywords. In this way, the user is relieved from choosing keywords.
In another embodiment according to the invention, the set of documents is formed by a document database or document file system. As an example, a large volume storage system, such as a CD-ROM or DVD, containing a large and diverse set of documents may be supplied with the pattern recognition system, allowing the (prospective) user to select pertinent documents from this set.
In another embodiment according to the invention, the set of documents is formed by documents in a distributed computer system. This allows for centrally storing (e.g. in a server) a larger set of documents than would normally be feasible to store or provide to a client computer on which the pattern recognition system is to be executed. Alternatively, a very large set of documents may be distributed over several servers. A good example of this last situation is Internet. Particularly if a system like Internet is used, many of the selected documents will reflect the language used at that moment, allowing for an upto-date vocabulary and/or language model to be created.
In another embodiment according to the invention, a network search engine, like those commonly used on Internet, is used to identify relevant documents based on the search criteria supplied to the search engine.
In another embodiment according to the invention, a network search agent, which autonomously searches the distributed computer system based on the search criterion, is used to identify relevant documents and, optionally, for retrieving the documents.
In another embodiment according to the invention, the training corpus is updated at a later moment selecting at least one further document from the set of documents and combining the further document with at least one previously selected document to form the training corpus. Particularly, if such updating is based on recent documents (e.g. retrieved via Internet), the language model can be kept up-to-date.
To achieve the object, the pattern recognition system is characterised in that the system comprises: means for determining at least one context identifier; means for deriving at least one search criterion from the context identifier; means for selecting documents from a set of documents based on the search criterion; and means for composing the training corpus from the selected documents. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments shown in the drawings.