1. Field of the Invention
The present invention relates to speech processing apparatuses, speech processing methods, and recording media therefor. More particularly, the invention relates to a speech processing apparatus and a speech processing method for performing easy and highly precise adaptation of models used for speech recognition. The invention also relates to a recording medium for storing a program implementing the above-described method.
2. Description of the Related Art
One of the known speech recognition algorithms is the Hidden Markov Model (HMM) method for recognizing input speech by using models. More specifically, in the HMM method, models (HMMs) defined by a transition probability (the probability of a transition from one state to another state) and an output probability (the probability of a certain symbol being output upon the occurrence of the transition of the state) are predetermined by learning, and then, the input speech is recognized by using the models.
In performing speech recognition, on-line adaptation processing is known in which the models are sequentially adapted by using the input speech in order to improve the recognition accuracy. According to this on-line adaptation processing, the precision of acoustic models is progressively enhanced and the task of language models is progressively adapted according to the amount of speech input by the speaker. Thus, this processing is an effective means for improving the recognition accuracy.
Methods for adapting the models are largely divided into two types: one type is “supervised learning” in which this method is implemented by providing a correct answer from a supervisor, and the other type is “unsupervised learning” in which this method is implemented by providing data which may be a correct answer (i.e., it is not certain that the data is actually correct) from a supervisor.
One conventional “unsupervised learning” method is the one disclosed in, for example, Japanese Unexamined Patent Application Publication No. 11-85184, in which adaptation of models is performed on input speech by using the speech recognition result as a supervisor in a speech recognition apparatus. In a conventional “unsupervised learning” method, such as the one disclosed in the above-described publication, it is not checked with the user whether the speech recognition result is correct. Thus, in this method, there is less burden on the user, but on the other hand, the reliability of the data used as a supervisor is not high enough, whereby the models may not be sufficiently adapted for the speaker.
One conventional “supervised learning” method is the one discussed in, for example, Q. Huo et al., A study of online-Quasi-Bayes adaptation for DCHMM-based speech recognition, Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1996, pp. 705-708. In a speech recognition apparatus, the user is requested to issue a certain amount of speech, and the models are adapted by using the speech. Alternatively, in a speech recognition apparatus, the user is requested to check whether the speech recognition result is correct, and the models are adapted by using the result which was determined to be correct.
However, the above-described model adaptation method implemented by requiring a certain amount of speech is not suitable for on-line adaptation. The model adaptation method implemented by requesting the user to check the speech recognition result imposes a heavy burden on the user.
Another method for adapting models is the one disclosed in, for example, Japanese Unexamined Patent Application Publication No. 10-198395, in which language models or data for creating language models are prepared according to tasks, such as according to specific fields or topics, and different tasks of language models are combined to create a high-precision task-adapted language model off-lines. In order to perform on-line adaptation by employing this method, however, it is necessary to infer the type of task of the speech, which makes it difficult to perform adaptation by the single use of a speech recognition apparatus.