The present invention concerns the field of automatic speech recognition.
A system of speech recognition includes two main functional units: a parametrization unit and a recognition unit. To these there is often added a learning unit serving to construct the dictionary of references used by the recognition unit.
The parametrization unit calculates relevant parameters on the basis of speech signals picked up by a microphone. These calculations are carried out according to a parametric representation chosen in order to differentiate vocal forms in the best possible way, separating the semantic information contained in the speech from the aesthetic information peculiar to diction. Cepstral representations constitute an important class of such representations (see EP-A-0 621 582).
The recognition unit makes the association between an observed segment of speech, represented by the parameters calculated by the parametrization unit, and a reference for which another set of parameters is stored in a dictionary of references. The sets of parameters stored in the dictionary in association with the different references can define deterministic models (they are for example composed directly of vectors coming from the parametrization unit). But most often, in order to take into account the variability of speech production and of the acoustic environment, sets of parameters which characterise stochastic models are rather used. Hidden Markov models (HMM) constitute an important class of such models. These stochastic models make it possible, by searching out the maximum likelihood, to identify the model which takes into account in the best way the observed sequence of parameters, and to select the reference associated with this model (see L. R. RABINER: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition"; Proceedings of the IEEE, Vol. 77, No. 2, February 1989, pages 257-285).
In general, the recognition of a word or a speech segment is not limited to searching for the maximum likelihood. One or more other likelihood criteria are examined to determine if the optimum model, presenting the maximum likelihood, should in fact be selected. This criterion is for example that the maximised likelihood exceeds a certain threshold. If the criterion is verified, the optimum model is selected and the recognition unit provides a result.
Otherwise, several solutions can be used: a first solution is to ask the speaker to provide confirmation that the speech segment uttered corresponds to the reference associated with the optimum model, or to one of the references associated with the n models for which the likelihoods are greatest (see EP-A-0 651 372). The user then has to carry out special manipulations in order to validate his choice, which is not ergonomic, especially for applications in hands-free mode.
Another solution is to ask the speaker to repeat what he has just said. If the criterion of likelihood is verified by the optimum model proposed as a result of the recognition test carried out on this repetition, the recognition terminates. In the contrary case, another repetition is requested, etc. This second solution is not very well suited to noisy environments or environments that are disrupted by multiple speakers: the noise interrupting the first pronunciation and causing the non-verification of the likelihood criterion will often interrupt the repetition, thus causing a further non-verification of the criterion, in such a way that the user finds himself forced to repeat the same word several times without success. If an attempt is made to overcome this disadvantage by adopting a less severe criterion of likelihood, the system tends to make numerous false starts in noisy environments.
EP-A-0 573 301 describes a method in which, in order to establish the ranking on which the recognition is based, the a priori probability of pronunciation of a word associated with a reference is replaced, after the repetition by the speaker, by the conditional probability of pronunciation of this word knowing that the same word has been said twice. The conditional probability is calculated with the aid of a development in accordance with Bayes theorem. This method thus seeks to refine the absolute values of the recognition scores of different entries in the dictionary of references.
An object of the present invention is to propose an effective solution for recognising speech in ambiguous cases.