(1) Field of the Invention
The present invention concerns the field of speech interfaces.
More precisely, the invention relates to the optimization of language models and/or of phonetic units in terminals using speech recognition.
(2) Description of Related Art
Information or control systems are making ever increasing use of a speech interface to make interaction with the user faster and/or more intuitive. Since these systems are becoming ever more complex, the requirements in terms of speech recognition are ever more considerable, both as regards the extent of recognition (very large vocabulary) and the speed of recognition (real time).
Speech recognition processes based on the use of language models (probability that a given word of the vocabulary of the application follows another word or group of words in the chronological order of the sentence) and of phonetic units are known in the state of the art. These techniques are in particular described in the work by Frederik Jelinek “Statistical Methods for Speech Recognition” published by MIT Press in 1997.
These techniques rely on language models and phonetic units which are produced from representative speech samples (emanating for example from a population of users of a terminal who are made to utter commands).
In practice, the language models must take account of the speaking style ordinarily employed by a user of the system, and in particular of his “defects”: hesitations, false starts, change of mind, etc.
The quality of a language model used greatly influences the reliability of the speech recognition. This quality is most often measured by an index referred to as the perplexity of the language model, and which schematically represents the number of choices which the system must make for each decoded word. The lower this perplexity, the better the quality.
The language model is necessary to translate the speech signal into a textual string of words, a step often used by dialogue systems. It is then necessary to construct a comprehension logic which makes it possible to comprehend the query so as to reply to it.
There are two standard methods for producing large vocabulary language models:
The so-called N-gram statistical method, most often employing a bigram or trigram, consists in assuming that the probability of occurrence of a word in the sentence depends solely on the N words which precede it, independently of the rest of its context in the sentence.
If one takes the example of the trigram for a vocabulary of 1000 words, it would be necessary to define 10003 probabilities to define the language model, this being impossible. The words are therefore grouped into sets which are either defined explicitly by the model designer, or deduced by self-organizing methods.
This language model is therefore constructed from a text corpus automatically.
This type of language model is used mainly for speech dictation systems whose ultimate functionality is to translate the speech signal into a text, without any comprehension phase being necessary.
The second method consists in describing the syntax by means of a probabilistic grammar, typically a context-free grammar defined by virtue of a set of rules described in the so-called Backus Naur Form or BNF, or an extension of this form to contextual grammars. The rules describing grammars are most often handwritten. This type of language model is suitable for command and control applications, in which the recognition phase is followed by a phase of controlling an appliance or of searching for information in a database.
The language model of an application describes the set of expressions (for example sentences) that the application will be required to recognize. A drawback of the prior art is that, if the language model is of poor quality, the recognition system, even if it performs extremely well at the acoustico-phonetic decoding level, will have mediocre performance for certain expressions.
The stochastic type language models do not have, properly speaking, a clear definition of the expressions which are in the language model, and of those which are outside. Certain expressions simply have a higher a priori probability of occurrence than others.
The language models of probabilistic grammar type show a clear difference between expressions belonging to the language models, and expressions external to the language model. In these models, expressions therefore exist which will never be able to be recognized, regardless of the quality of the phonetic models used. These are generally expressions having no meaning or that carry a meaning outside of the field of application of the system developed.
It turns out that the language models of probabilistic type and their derivatives are more effective for command and control applications. These grammars are often written by hand, and one of the main difficulties of the development of dialogue systems is to offer a language model of good quality.
In particular, as far as the models of grammar type are concerned, it is not possible to exhaustively define a language in particular if the latter is likely to be used by a large population (case for example of a remote control for mass-market appliances). It is not possible to take account of all the possible expressions and turns of phrase (from formal language to slang), and/or of errors of grammar, etc.