The invention relates to a huge vocabulary speech recognition system for recognizing a sequence of spoken words, the system comprising input means for receiving a time-sequential input pattern representative of the sequence of spoken words; and a large vocabulary speech recognizer operative to recognize the input pattern as a sequence of words from the vocabulary using a large vocabulary recognition model associated with the speech recognizer.
From U.S. Pat. No. 5,819,220 a system is known for recognizing speech in an Internet environment. The system is particularly targeted towards accessing information resources on the World Wide Web (WWW) using speech. Building a speech recognition system as an interface to the Web faces very different problems from those encountered in traditional speech recognition domains. The primary problem is the huge vocabulary which the system needs to support, since a user can access virtually any document on any topic. It is very difficult, if not impossible, to build an appropriate recognition model, such as a language model, for those huge vocabularies. In the known system a predetermined recognition model, including a statistical n-gram language model and an acoustic model, is used. The recognition model is dynamically altered using a web-triggered word set. An HTML (HyperText Mark-up Language) document contains links, such as hypertext links, which are used to identify a word set to be included in the final word set for probability boosting the word recognition search. In this way the word set used for computing the speech recognition scores are biased by incorporating the web-triggered word set.
The known system requires a suitable huge vocabulary model as a starting model to be able to obtain a biased model after adaptation. In fact, the biased model can be seen as a conventional large vocabulary model optimized for the current recognition context. As indicated before, it is very difficult to build a suitable huge vocabulary model, also if it is only used as a starting model. A further problem occurs for certain recognition tasks, such as recognizing input for particular Web sites or HTML documents, like those present on search engines or large electronic shops, such as book stores. In such situations the numbers of words which can be uttered is huge. A conventional large vocabulary model will in general not be able to effectively cover the entire range of possible words. Biasing a starting model with relatively few words will not result in a good recognition model. Proper biasing would require a huge additional word set and a significant amount of processing, assuming the starting model was already reasonably good.
It is an object of the invention to provide a recognition system which is better capable of dealing with huge vocabularies.
To achieve the object, the system is characterized in that the system comprises a plurality of N large vocabulary speech recognizers, each being associated with a respective, different large vocabulary recognition model; each of the recognition models being targeted to a specific part of the huge vocabulary; and the system comprises a controller operative to direct the input pattern to a plurality of the speech recognizers and to select a recognized word sequence from the word sequences recognized by the plurality of speech recognizers.
By using several recognizers each with a specific recognition model targeted at a part of the huge vocabulary, the task of building a recognition model for a huge vocabulary is broken down into the manageable task of building large vocabulary models for specific contexts. Such contexts may include health, entertainment, computer, arts, business, education, government, science, news, travel, etc. It will be appreciated that each of those contexts will normally overlap in vocabulary, for instance in the general words of the language. The contexts will differ in statistics of those common words as well in the jargon specific for those contexts. By using several of those models to recognize the input, a wider range of utterances can be recognized using properly trained models. A further advantage of using several models is that this allows a better discrimination during the recognition. If one huge vocabulary was used, certain utterances would only be recognized in one specific meaning (and spelling). As an example, if a user pronounces a word sounding like xe2x80x98colorxe2x80x99 most of the recognized word sequences will include the very common word xe2x80x98colorxe2x80x99. It will be less likely that the word xe2x80x98collarxe2x80x99 (of a fashion context) is recognized, or xe2x80x98collarxe2x80x99 of collared herring (food context), or collar-bone (health context). Those specific words do not have much chance of being recognized in a huge vocabulary which inevitably will be dominated by frequently occurring word sequences of general words. By using several models, each model will identify one or more candidate word sequences from which then a selection can be made. Even if in this final selection a word sequence with xe2x80x98colorxe2x80x99 gets selected, the alternative word sequences with xe2x80x98collarxe2x80x99 in it can be presented to the user.
Preferably, the recognizers operate in parallel in the sense that the user does not experience a significant delay in the recognition. This may be achieved using separate recognition engines each having own processing resources. Alternatively, this may be achieved using a sufficiently powerful serial processor which operates the recognition tasks in xe2x80x98parallelxe2x80x99 using conventional time slicing techniques.
It should be noted that using parallel speech recognition engines is known. U.S. Pat. No. 5,754,978 describes using recognition engines in parallel. All of the engines have a relatively high accuracy of, e.g. 95%. If the 5% inaccuracy of the engines does not overlap, the accuracy of recognition can be improved. To ensure that the inaccuracies do not fully overlap, the engines may be different. Alternatively, the engines may be identical in which case the input signal to one of the engines is slightly pertubated or one of the engines is slightly pertubated. A comparator compares the recognized text and accepts or rejects the text based on the degree of agreement between the output of the engines. Since this system requires accurate recognition engines, which do not exist for huge vocabularies, this system provides no solution for huge vocabulary recognition. Neither does the system use different models targeted towards specific parts of a huge vocabulary.
WO 98/10413 describes a dialogue system with an optional number of speech recognition modules which can operate in parallel. The modules are targeted towards a specific type of speech recognition, such as isolated digit recognition, continuous number recognition, small vocabulary word recognition, isolated large vocabulary recognition, continuous word recognition, keyword recognition, word sequence recognition, alphabet recognition, etc. The dialogue system knows up front which type of input the user will supply and accordingly activates one or more of the specific modules. For instance, if the user needs to speak a number, the dialogue engine will enable the isolated digit recognition and the continuous number recognition, allowing the user to speak the number as digits or as a continuous number. The system provides no solution for dealing with huge vocabularies
The recognition models of the system according to the invention may be predetermined. Preferably, as defined in dependent claim 2, a model selector is used to dynamically select at least one of the models actively used for recognition. The selection depends on the context of the user input, like the query or dictation subject. Preferably, the model selector selects many of the recognition models. In practice, at least one of the models will represent the normal day-to-day vocabulary on general subjects. Such a model will normally always be used.
In an embodiment as defined in dependent claim 3, the document defines the recognition context. As defined in the dependent claim 5, this may be done by scanning the words present in the document and determining the recognition model(s) which are best suited to recognize those words (e.g. those models which have most words or word sequences in common with the document).
In an embodiment as defined in the dependent claim 4, the context (or contexts) is indicated in a Web page, e.g. using an embedded tag identifying the context. The page may also indicate the context (or context identifier), for instance, via a link.
In an embodiment as defined in the dependent claim 6, the system actively tries to identify those recognition models which are suitable for the current recognition task. In addition to the recognition models which are at that moment actively used for the recognition, the other models are tested for their suitability. This testing may be performed as a background task by using one or more additional recognizers which check whether the not-used models would have given a better result than one of the actively used models. Alternatively, the actual recognizers may be used to test the test models at moments that the recognizer has sufficient performance left over, e.g. when the user is not speaking. The testing may include all input of the user. Particularly if the user has already supplied a lot of speech input, preferably the testing is limited to the most recent input. In this way, whenever the user changes subject quickly more suitable models can be selected. A criterion for determining which models are best suited, i.e. offer the highest accuracy of a recognition, is preferably based on performance indications of the recognition like scores or confidence measures.
In an embodiment as defined in the dependent claim 7, the recognition models are hierarchically arranged. This simplifies selecting suitable models. Preferably, recognition is started with a number of relatively generic models. If a certain generic model proves to provide good recognition result, more specific models can be tested to improve the recognition even further. Some of the more specific models may be shared by several more generic models. If at a certain moment the recognition results of a specific model become worse, several of the more generic models hierarchically above the specific model may be tried. This allows smooth transition from one context to another. As an example, a user may start with providing input on the generic context of health. At a certain moment it may be detected that the user is primarily focussing on the more specific context of medical centers or institutes, and even goes down to the most specific context of health farms. Particularly if the health farm is located in an attractive area, this may inspire the user to move to the more generic context of holidays or travel or, more specifically, travel in area of the health farm.
As defined in the dependent claim 8, the recognition may be done by a separate recognition server. In the context of Internet, such a server could be a separate station on the net, or be integrated with existing stations, such as a search engine, or a service provider, like an electronic book store. Particularly, recognition servers which operate for many users need to be able to support a vocabulary suited for most users. The use of several, specific large vocabulary models makes such a system better capable of performing this task with a high recognition accuracy.