The invention relates to a small vocabulary pattern recognition system for recognizing a sequence of words; the vocabulary storing a representation of a plurality of reference words; the system comprising:
input means for receiving a time-sequential input pattern representative of a spoken or written word sequence;
a pattern recognizer comprising a word-level matching unit for generating a plurality of sequences of words by statistically comparing the input pattern to the representations of the reference words of the vocabulary.
Increasingly use is made of small vocabulary pattern recognition systems for recognizing sequences of words, such as digit strings or command sequences. Such systems are, for instance, used for voice control of communication, computer or audio/video equipment. As an example, a user can make a phone call by speaking a telephone number, possibly followed by a spoken xe2x80x9cdialxe2x80x9d instruction. Also a computer operating system and the various application programs can be operated via voice commands. Besides being used for recognizing speech representative input, the invention covers also small vocabulary character/word recognition systems like handwriting recognition system, where the input signal represents a written or printed character/word. The system may, for instance, be used for recognizing written/typed digit strings like account numbers. A small vocabulary system typically has a vocabulary in the range of up to a few hundred entries, referred to as words. In fact, such a word may be represent a single character like a digit for digit string recognition or a command, which actually may be formed by more than one spoken/printed word (such as xe2x80x9csave filexe2x80x9d) for recognition of command sequences. Normally recognition of an input pattern, such as sampled speech or handwriting, takes place in two steps. In the first step, a segment of the input signal which represents a word is compared against trained material. Since variations occur in speaking, writing, or printing of words, the first step comparison results in identifying several possible words of the vocabulary which in a statistical sense match the input signal segment. Consequently, the first step recognition of an input signal results in identifying several sequences of candidate words. These sequences may be represented using a graph. Usually, the sequences have been given a statistical likelihood reflecting how well the input pattern matches the individual reference words. In a second step, a most likely sequence is selected based on the likelihood of the sequence (in combination with the already established likelihood of the individual word-matching). For large vocabulary systems, the second step is usually based on a statistical language model, which provides statistical information on occurrence of a word or word sequence in a typical text. Such a system is disclosed in L.Rabiner, B-H. Juang, xe2x80x9cFundamentals of speech recognitionxe2x80x9d, Prentice Hall 1993, pages 434 to 454. Frequently, so-called bigrams are used which specify the likelihood of occurrence of a word pair. The language model is built up-front by analyzing large text corpora with several millions of words representative for the word sequences to be recognized. In some systems, the built-in language model can be updated during use of the system.
For small vocabulary systems, the initial identification of word candidates is simpler than for large vocabulary systems since the vocabulary and the amount of trained material is smaller. For instance, for recognizing a digit string, such as a telephone number, the vocabulary can be a small as representing only ten digits. However, the second step of selecting and filtering between possible sequences is difficult to perform for many applications. The number of different telephone numbers occurring in a country or even worldwide is huge. Moreover, besides a few frequently used numbers, many numbers are used with a similar frequency resulting in a low level of statistical discrimination. Similarly, for command and control of a computer a user can select between a very large number of valid command sequences and hardly any a-priori knowledge exists of frequently used sequences. Therefore, it is difficult to create and use a conventional large vocabulary language model for most small vocabulary systems. Instead, small vocabulary systems may use finite state models, where a state corresponds to a word, to restrict the possible word sequences to transitions of the model. Typically, all words have been assigned an equal likelihood and no distinction in likelihood is made between word sequences allowed according to the finite state model.
It is an object of the invention to provide a small vocabulary pattern recognition system of the kind set forth which is better capable of selecting between candidate sequences of words.
To meet the object of the invention, the system includes a cache for storing a plurality of most recently recognized words; and the speech recognizer comprises a sequence-level matching unit for selecting a word sequence from the plurality of sequences of words in dependence on a statistical language model which provides a probability of a sequence of M words, Mxe2x89xa72; the probability depending on a frequency of occurrence of the sequence in the cache. By using a cache, the system keeps track of the most recent behavior of the user. Although the total number of word sequences, such as telephone numbers, may be huge and it may be difficult to statistically discriminate between the numbers in a general manner, this tends not to be the case for individual users. For instance, usually the set of telephone numbers used by an individual is limited to less than hundred. Moreover, some numbers are used much more frequently than others are. Similarly, for command and control it may be difficult to establish generally used sequences of command. However, many individual users have certain preferred ways of operating equipment. This typical user behavior can be xe2x80x98capturedxe2x80x99 effectively in the cache. For instance, a user who regularly watches a web-page on stocks, probably will regularly issue the command sequence xe2x80x9cxe2x80x98open explorerxe2x80x99, xe2x80x98favoritesxe2x80x99, xe2x80x98stockxe2x80x99xe2x80x9d. By storing this sequence of three commands in the cache, this sequence can be selected as being more likely than most other 3-command sequences. By using the data stored in the cache for the language model, a language model is used which adapts to the individual user and to recent behavior of the user. Preferably, a word sequence is only stored in the cache if the word sequence has been xe2x80x98successfullyxe2x80x99 recognized, e.g. the recognized telephone number resulted in a telephone connection being made.
In an embodiment as defined in the dependent claim 2, a backing-off strategy is used where the language model provides a non-zero probability for both cache-hits and cache-misses. In this way, word sequences which result in a cache miss still have a reasonable chance of being selected and not being suppressed by a word sequence which in the first recognition step was identified as less likely (e.g. phonetically less similar) but is present in the cache (and consequently gets an increased likelihood by using the language model). This also allows the use of a relatively small cache.
In an embodiment as defined in the dependent claim 3, a normalized value is used for cache-misses. Moreover, the likelihood for cache hits converges to the normalized value as the number of occurrences in the cache decreases. This provides a smooth transition in probability between cache-hits and cache-misses.
In an embodiment as defined in the dependent claim 4, a discounting parameter is used to reduce the impact of cache-hits on the probability, smoothing the probabilities further.
These and other aspect s of the invention will be apparent from and elucidated with reference to the embodiments shown in the drawings.
In an embodiment as defined in the dependent claim 5, a simple language model is used for selecting between strings (or sub-strings) by comparing the entire (sub-)string to individual words in the cache. The relative number of cache-hits, in combination with smoothing operations, provides the probability of the (sub-)string.
In an embodiment as defined in the dependent claim 6, an M-gram language model is used, allowing comparison of only M words to the cache (or less than M if the sequence is still shorter) instead of the entire sequence. Advantageously, in case of a cache miss for the M word sequence, a backing off to a shorter sequence (of M-1 words) is used. Particularly for telephone numbers this allows better recognition of local numbers starting with a same digit sequence, even if the specific number is not yet in the cache.
In an embodiment as defined in the dependent claim 7, a special symbol is used (and preferably also stored in the cache for each recognized sequence) to separate between sequences. For instance, if a special beginning of sequence symbol is used, a new sequence (with that special symbol and some more following words) automatically will result in hits only if the words actually occur at the same place in the sequence.
Preferably, at least a trigram is used, allowing for good discrimination of the possible word sequences. Advantageously, a four-gram or five-gram is used, which provides a good balance between accurate selection and correctness of the language model using a relatively small cache of, for instance, 100 entries.