1. Field of the Invention
The present invention relates to a pattern recognition apparatus and a pattern recognition method for speech recognition and character recognition, for example.
2. Description of the Related Art
Hereinafter, description is made using speech recognition as an example, but the description also relates to the other types of recognition. In the last 10 years, performance of speech recognition has significantly improved. One of the biggest factors is that the method of training an acoustic model has shifted from maximum likelihood (ML) to discriminative training. This approach aims at improving the performance by referring to correct labels with a single system.
In contrast, approaches based on system integration (for example, recognizer output voting error reduction: ROVER) aim at improving the performance by using multiple systems.
To be specific, the approaches can obtain a better hypothesis among hypotheses of a base system and complementary systems based on a majority rule. As a result, even if performance of the complementary systems is lower than that of the base system, higher performance can be obtained than in the case where only the base system is used.
Meanwhile, there has been known a technology in which, when there are multiple models, for the purpose of reinforcing a certain specific model, training data to be used to train the model is efficiently selected (see, for example, Japanese Patent Application Laid-open No. 2012-108429). This technology is related to the present invention in that an utterance having a low recognition rate is selected by using recognition results for the multiple models including the specific model and the specific model is updated and trained by using the selected utterances with the corresponding correct labels. However, this technology is focused on selecting the training data and is also different in configuration of the training system.
There has also been known a technology in which weights for speech feature statistics of correct labels and speech feature statistics of error hypotheses are determined. These weights are used to compensate the speech feature statistics of correct labels and error hypothesis, which can be used to compute additional speech feature statistics for each discriminative criterion (e.g., minimum classification error, maximum mutual information, or minimum phone error), to thereby update an acoustic model (see, for example, Japanese Patent Application Laid-open No. 2010-164780). This technology is partially related to the present invention in that the single acoustic model is updated, but provides no description on the multiple models.
There has also been known a technology in which multiple models are constructed to be optimized for each environment (see, for example, Japanese Patent Application Laid-open No. 2010-204175). As opposed to the present invention, this technology does not construct a combination of systems so as to improve performance, and is also different in configuration of the training system.
Further, there has been known a technology in which a statistic model is constructed for every N training data set and a statistic model that gives the highest recognition rate is selected (see, for example, Japanese Patent Application Laid-open No. 2010-152751). As opposed to the present invention, this technology does not construct multiple systems simultaneously.
In system integration, it is efficient to integrate hypotheses having different tendencies, and in order to construct a complementary system having a different output tendency, different features and model training methods are used. However, when the hypothesis of the complementary system exhibits a tendency similar to that of a base system or includes too many errors, the system integration does not always improve performance.
In order to address this problem, conventionally, it has often been the case that a number of systems are created and several best combinations of the multiple system outputs are determined in terms of the performance of a development set. With such trial-and-error attempts, the systems are overtuned to a specific task and robustness against unknown data is reduced. Therefore, it is desired that the complementary system be constructed based on some theoretical training criteria.