The present invention relates to a speech recognition method for recognizing input speech using phoneme and language models, as well as to a speech recognition system adopting that method.
Today, functions and devices of speech recognition are finding their way into small-sized data apparatuses such as portable speech translators and personal digital assistants (PDA), as well as into car navigation systems and many other appliances and systems.
A conventional speech recognition method typically involves storing phoneme and language models beforehand and recognizing input speech based on the stored models, as described illustratively in xe2x80x9cAutomatically Translated Telephonexe2x80x9d (pp. 10-29, from Ohm-sha in Japan in 1994, edited by Advanced Telecommunications Research Institute International). A language model is made up of pronunciations of different words and syntax rules, whereas each phoneme model includes spectral characteristics of each of a plurality of speech recognition units. The speech recognition unit is typically a phoneme or a phoneme element that is smaller than a phoneme. The background art of this field will be described below with phonemes regarded as speech recognition units. Spectral characteristics stored with respect to each phoneme may sometimes be referred to as a phoneme model of the phoneme in question.
The language model determines a plurality of allowable phoneme strings. At the time of speech recognition, a plurality of phoneme model strings are generated corresponding to each of the allowable phoneme strings. The phoneme model strings are each collated with the input speech so that the phoneme model string of the best match is selected. In collating each phoneme model string with the input speech, the input speech is divided into segments called frames. The frames are each collated successively with a plurality of phoneme models constituting each phoneme model string so as to compute evaluation values representing similarities between the phoneme model in question and the input speech. This collating process is repeated with different phoneme model strings, and then with different frames. The evaluation values obtained by collating the phoneme models of each phoneme model string with a given frame of the input speech are also used in the collation of the next frame.
As outlined above, the conventional speech recognition method takes time to make processing because it involves collating all frames of the input speech with all phoneme models in all phoneme model strings. Furthermore, it is necessary to retain in memory, for collation of the next frame, the evaluation values acquired by collating the phoneme models in each phoneme model string with a given frame of the input speech. As a result, an ever-larger amount of memory is needed the greater the total number of phoneme model strings.
The so-called beam search method has been proposed as a way to reduce such prolonged processing time. This method involves, at the time of collating the input speech with each frame, limiting the phoneme models only to those expected to become final candidates for speech recognition. More specifically, checks are made on all phoneme model strings to see, based on the evaluation values computed in a given frame for all phoneme model strings, whether each of the phoneme models should be carried forward for collation in the next frame. There are a number of schemes to determine how to carry forward phoneme models: (1) a fixed number of phoneme models starting from the model of the highest evaluation value are carried forward; (2) an evaluation value threshold is computed so that only the phoneme models with their evaluation values higher than the threshold are carried forward; or (3) the above two schemes are used in combination.
One disadvantage of the conventional beam search method is that it takes time to select phoneme models. That is, scheme (1) above of carrying forward a fixed number of phoneme models starting from the model of the highest evaluation value must sort the evaluation values of all phoneme models. Sorting generally takes time. According to scheme (2) above whereby only the phoneme models with their evaluation values higher than a threshold are carried forward, it also takes time to compute the threshold value.
It is therefore an object of the present invention to provide a speech recognition method suitable for minimizing computing time and for reducing the required memory capacity, and a speech recognition system adopting that method.
In carrying out the invention and according to one aspect thereof, there is provided a speech recognition method for collating a portion of speech (e.g., frame) with part of a plurality of speech recognition units (e.g., phonemes or phoneme elements) representing speech candidates. Depending on the result of the collation with the current speech portion, the method dynamically selects that part of speech recognition units which is to be collated with the next speech portion. Because only the necessary parts of speech recognition units are collated, the processing time and memory area for collation purposes are significantly reduced.
The inventive speech recognition method comprises the steps of:
(a) collating one of the plurality of speech candidates successively with an ordered plurality of speech parts obtained by dividing the target speech; and
(b) performing the step (a) on another plurality of speech candidates;
wherein the step (a) includes the steps of:
(a1) determining a plurality of likelihoods representing similarities between one of the ordered plurality of speech parts on the one hand, and a portion of speech recognition units constituting part of an ordered plurality of speech recognition units representing one of the plurality of speech candidates on the other hand;
(a2) determining a plurality of evaluation values representing similarities between the portion of speech recognition units and the target speech, based on the plurality of likelihoods determined in the step (a1) and on a plurality of transition probabilities corresponding to different combinations of the portion of speech recognition units; and
(a3) determining, based on the determined plurality of evaluation values, a new portion of speech recognition units for use with the next speech part in the ordered plurality of speech parts;
wherein the new portion of speech recognition units is used when the step (a) is carried out on the next speech part in the ordered plurality of speech parts.