1. Field of the Invention
The present invention relates to a method and apparatus for speech recognition which performs recognition of the speech of an unspecified speaker by referring to a word dictionary in which the phonemes of words are stored.
2. Description of the Related Art
Recently, techniques using phonemes or syllables as a unit have been investigated in speech recognition apparatus. Such techniques depend on the following considerations.
In a speech recognition apparatus targeting large vocabularies, a large memory capacity is required to store the standard patterns for every word. In addition, much labor is required to register these words and it becomes difficult to append new words. In contrast, the method using phonemes and the like as a basic unit for recognition eliminates these problems, since the words written in Roman characters (romaji) or in Japanese syllables (kana) can be stored in a dictionary.
However, since there are variations in the spoken phoneme spectrum, combinations of intonations and difficult to recognize phonemes, such as plosives, speech recognition is not easy. Furthermore, individual differences also affect speech recognition when the speech of unspecified speakers is to be recognized, making it even more difficult.
Therefore, the following techniques have been investigated to deal with these problems:
(1) learning vowels; PA1 (2) the statistical discrimination method; PA1 (3) the hidden Markov model; and PA1 (4) the multi-template method.
However, since each phoneme in Japanese differs from every other in the appearance of phonemes by groups of phonemes, speech is difficult to recognize based upon a uniform method.
For example, vowels are characterized by the relative position of the formant, semivowels, plosives and so on; each of these, in turn, are characterized by a characteristic change in the spectrum over time. Furthermore, although certain changes in the spectrum are characteristic of each semivowel and plosive, there are differences in that the spectrum changes relatively slowly for semivowels. In contrast, the spectrum rapidly changes in a short time for plosives.
In recognizing these differently characterized phonemes, these techniques are defective in that a high recognition rate cannot be obtained because all the phonemes are recognized uniformly using one of the above described methods of the conventional apparatuses. For example, detection of the characteristics of segments aimed at the recognition of continuous speech (Kosaka, et al., Japanese Acoustics Society, Voice Section, S85-53, December 1985) can be cited as the method belonging to the aforementioned method (2). However, although the recognition rates for plosives and so on are high because this method is devised so as to correspond to the changes of the spectrum in time, this method is not suitable for recognizing phonemes, such as semivowels whose spectrum slowly changes and whose time structure changes, because it does not cope with the variation of the change of the spectrum with respect to time.
In addition, there are systems which perform phoneme recognition on each rough classified group by methods different from each other after rough segmentation, like a system devised at MIT Lincoln Institute (ref. Yasunaga Niimi, Speech Recognition, pp. 81-87, Kyoritsu Publishing, October 1979). However, there is a defect that the segmentation is greatly loaded in such a system and the recognition rate greatly depends on the quality of the segmentation.