The invention deals with the recognition of speech in an audio signal, for example an audio signal spoken by a speaker.
More particularly, the invention relates to an automatic voice recognition method and system based on the use of acoustic models of voice signals, wherein speech is modeled in the form of one or more successions of voice units each corresponding to one or more phonemes.
A particularly interesting application of such method and system concerns the automatic recognition of speech for voice dictation or in the case of telephone-related interactive voice services.
Various types of modeling can be used in the context of speech recognition. In this respect, reference can be made to the article by Lawrence R. Rabiner entitled “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, volume 77, No. 2, February 1989. This article describes the use of hidden Markov models to model voice sequences. According to such a modeling, a voice unit, for example a phoneme or a word, is represented in the form of a sequence of states, each associated with a probability density modeling a spectral shape that has to be observed on this state and that results from an acoustic analysis. A possible variant of implementation of the Markov models consists in associating the probability densities with the inter-state transitions. This modeling is then used to recognize a spoken speech segment by comparison with the available models associated with known units by the voice recognition system and obtained by a prior learning process.
The modeling of a voice unit is, however, strongly linked to the context in which a voice unit is situated. In practice, a phoneme can be pronounced in different ways depending on the phonemes that surround it.
Thus, for example, the French language words “étroit” and “zéro” which can be represented phonetically as follows:
“ei t r w a”;
and
“z ei r au”,
contain a phoneme “r”, the sound of which differs because of the sound of the phonemes that surround it.
In order to take account of the influence of the context in which a phoneme is situated, the voice units are normally modeled in the form of triphones which take account of the context in which they are situated, that is, according to the preceding voice unit and the next voice unit. Thus, by considering the words “étroit” and “zéro”, these words can be retranscribed by means of the following triphones:
étroit: &[ei]t ei[t]r t[r]w r[w]a w[a]&
zéro: &[z]ei z[ei]r ei[r]au r[au]&
According to this representation, the “&” sign is used to mark the limits of a word. For example, the triphone ei[t]r denotes a unit modeling the phoneme “t” when the latter appears after the phoneme “ei” and before the phoneme “r”.
Another approach taking account of the context of a phoneme can consist in using voice models with voice units that correspond to a set of phonemes, for example a syllable. According to this approach, the words “étroit” and “zéro” can be represented, using a voice unit corresponding to a syllable, as follows:
étroit: ei t|r|w|a
zéro: z|ei r|au
As can be seen, such approaches require the availability of a large number of models to recognize words or sentences.
The number of units, taking into account contextual influences, depends greatly on the length of the context concerned. If the context is limited to the unit that precedes it and the unit that follows it, the possible number of contextual units is then equal to the number of non-context units to the third power. In the case of the phonemes (36 in French), this gives 363. In the case of the syllables, the result is N×N×N, with N being in the order of several thousands. In this case, the number of possible voice units increases prohibitively and then requires very great resources in terms of memory and computation capability to implement a reliable voice recognition method.
Furthermore, there is not enough learning data available to estimate correctly such a high number of parameters.
The object of the invention is to overcome the above-mentioned drawbacks and to provide a speech recognition method and system that makes it possible to limit considerably the number of parameters needed to model long voice units, namely, voice units corresponding to a syllable or to a series of phonemes.