Heretofore, methods for discriminating words that are spoken under noisy environments have been devised; as the typical methods, PMC (Parallel Model Combination) method, SS/NS (Spectral Subtraction/Nonlinear Spectral Subtraction) method, SFE (Stochastic Feature Extraction) method and others are known.
In any of the above-mentioned methods, a feature quantity of a voice data of a spoken voice that exists together with an environment noise is extracted, and it is judged that which of the acoustic models that are corresponding to the previously registered plural words is the one to which the feature quantity is most matching, and then the word that corresponds to the most matched acoustic model is output as the result of the recognition.
The features of the above-mentioned methods are described below. That is, as to PMC method, the correct recognition rate is excellent because environment noise information is directly incorporated in an acoustic model, but the calculation cost becomes high (high-level computation is required, therefore, the scale of the device becomes large, and the time needed for the processing is longer). As to SS/NSS method, the environment noise is eliminated on a stage for extracting the feature quantity of the voice data. Hence, the calculation cost is lower than that of PMC method, and so, this method is used in many cases at present. In this connection, the feature quantity of the voice data is extracted as a vector, in SS/NSS method. As to SFE method, the environment noise is eliminated on the stage for extracting the feature quantity of the mixed signal, in the same way as SS/NSS method, however, the feature quantity is extracted as the probability distribution.
By the way, in SFE method, the environment noise is not directly reflected on the speech recognition stage, that is, the information of the environment noise is not directly incorporated in the silence acoustic model, and so, there has been such a problem that the correct recognition rate is insufficient.
In addition, because the information of the environment noise is not directly incorporated in the silence acoustic model, as the time from the time point at which the speech recognition has been started till the time point at which the speech is started becomes longer, the correct recognition rate lowers; that was also the problem.