The present invention concerns the field of automatic speech recognition. It concerns more particularly recognition systems calling on a method of learning.
Such a system includes three main functional units: a parametrization unit, a learning unit and a recognition unit.
The parametrization unit calculates relevant parameters on the basis of speech signals picked up by a microphone. These calculations are carried out according to a parametric representation chosen in order to differentiate vocal forms in the best possible way, separating the semantic information contained in the speech from the aesthetic information peculiar to diction. Cepstral representations constitute an important class of such representations(see EP-A-0 621 582).
The recognition unit makes the association between an observed segment of speech, represented by the parameters calculated by the parametrization unit, and a reference for which another set of parameters is stored in a dictionary of references. The sets of parameters stored in the dictionary in association with the different references can define deterministic models (they are for example composed directly of vectors coming from the parametrization unit). But most often, in order to take into account the variability of speech production and of the acoustic environment, sets of parameters which characterise stochastic models are rather used. Hidden Markov models (HMM) constitute an important class of such models. These stochastic models make it possible, by searching out the maximum likelihood, to identify the model which takes into account in the best way the observed sequence of parameters, and to select the reference associated with this model (see L. R. RABINER: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition"; Proceedings of the IEEE, Vol. 77, No. 2, February 1989, pages 257-285).
The learning unit is used to determine the parameters which will be stored in the dictionary and used in the recognition phase. In general, the system asks the user to pronounce several times each word or segment: of speech which is to be associated with a reference. On the basis of these different observations, the learning unit estimates the model parameters which are to be stored in the dictionary. In the case where the dictionary contains stochastic models, this estimation generally amounts to carrying out calculations of the mean and of variance.
The learning process is a very important phase which greatly influences recognition performances. Incorrect or inadequate learning can not be completely compensated. for by the good performances of a recognition algorithm.
A careful user endeavours to carry out the learning in a silent environment, to keep his diction constant: and to avoid extraneous noise (mouth noises, respiration, other external noises . . . ). But many users, who have not been made aware of these problems, run the risk, after learning carried out in poor conditions, of obtaining performances which do not conform to those expected, and of rejecting the system.
To make this learning phase more robust, it is possible to increase the number of pronunciations required to create a reference model. Thus variations in pronunciation can be taken into account since the estimations of the parameters then rest on more complete statistics. The disadvantage of this solution is that it is not ergonomic, the user being required to pronounce each word too many times.
Another solution consists in making more robust the parameters used to represent the vocal forms. This solution does not permit the resolution of such problems as taking into account an intrusive word (spoken by another person or by the user himself) during the learning phase.
The European Patent Application 0 762 709 describes a learning process in which a recognition test is carried out on the first pronunciation of the new word by the user. If another word in the dictionary of references is recognised during this test, the user is warned that the word which he has just pronounced is too similar to another word in the dictionary. If the test does not lead to the recognition of another word in the dictionary, the user is invited to repeat the new word. Processing carried out on the repetitions does not bring about any recognition test. A rejection model ("garbage model") is simply used to "explain" portions of speech which are not part of the new word model previously formed. In other words, the model which is being worked out and the rejection model are used to bring about appropriate fragmentation in order to filter sound which may possibly be emitted by a hesitant or awkward user. With this fragmentation, the model which is in the learning phase is updated then examined to check whether the update has taken place in good conditions. Contrary to the test carried out on the first pronunciation of the word, this verification of the "good" update does not include any recognition test on the basis of the entire dictionary, including the words learned previously.
An object of the present invention is to make possible the realisation of good quality learning on the basis of a relatively low number of pronunciations of the words to be memorised.