Speech recognition methods are disclosed in Transaction of The Institute of Electronics and Communication Engineers of Japan. Vol. J63-D No. 12 pp. 1002–1009, December, 1980 and Japanese Patent Application Non-examined Publication No. H10-282986. In these speech recognition methods, speakers are previously classified by characteristics such as their ages to trained patterns.
A speaker adaptation method is also widely studied in Wakita, H. “Normalization of Vowels by Vocal-Tract Length and Its Application to Vowel Identification,” IEEE (Institute of Electrical and Electronics Engineers) Trans. ASSP 25 (2): pp. 183–192 (1977). This speaker adaptation method distorts a spectral frequency of a speech sound of a speaker by using a single pattern.
A maximum a posteriori estimation (MAP estimation) or the like is known as a speaker adaptation method capable of assimilating a detailed characteristic of a speaker. Technical Report of IEICE (The Institute of Electronics, Information and Communication Engineers) Vol. 93 No. 427 pp. 39–46 (SP93-133, 1993) discloses the MAP estimation.
This method, however, has a problem that if training utterances as a sample beforehand accumulated for an adaptation are extremely few, for example, using only one utterance is spoken, the adaptation cannot improve speech recognition.
A method having a higher recognition rate of a speaker independent word recognizer is disclosed in, for example, Japanese Patent Application Non-examined Publication No. H5-341798. In this speech recognition method, a speaker speaks one of names being given to a speech recognition apparatus, and the apparatus selects a database adequate to the speaker based on the speech sounds. After that, the speaker speaks a word to be recognized, and the word is processed by speech recognition using the selected database.
This method, however, has a problem that it is necessary to always examine firstly whether or not the utterance of the speaker is the name of the device, and therefore it takes time for processing. Additionally, this conventional apparatus simply selects databases to be used for a next utterance based on the discrimination whether or not the speaker is adapted, so that a large memory capacity for storing the databases is required.
In the prior art discussed above, detailed characteristics of a speaker are hardly assimilated based on a few utterances, namely only one word or several words at the most, which results in insufficient speech recognition performance.
It is an object of the present invention to improve speech recognition performance by assimilating detailed characteristics of a speaker based on a few utterances even if a memory capacity for storing databases is small.