As well-known, the speech recognition technology has high recognition performance of 95% or more word recognition rate (accuracy) with respect to tens of thousands of words only when speech recognition is performed in a relatively quiet environment.
However, since there are various noises in the actual environments where the speech recognition technology is used, the accuracy rapidly decreases as the performance of speech recognition lowers. For the practical use of the speech recognition technology, it needs to have high accuracy even in any noise environments.
To improve the recognition performance of a speech recognizer in noise environments, it is necessary to evaluate the recognition performance in the noise environments where the speech recognizer is actually used, analyze the factors lowering the recognition performance, improve the recognition method allowing for noises, and develop the suitable noise reducing/removing technology based on the result of analysis.
It is very important to accurately evaluate the performance of the speech recognizer in the various noise environments to improve the performance of the speech recognizer.
According to a conventional method for evaluating the performance of a speech recognizer, a person collects data of speech uttered through a microphone, builds speech DB (database) for evaluation by using the uttered speech data and off-line operates the speech recognizer to evaluate the performance of the speech recognition. That is, in the conventional method, a person directly utters parts or all of the words registered in the speech recognizer in the noise environments where the speech recognizer is actually used, generates utterance files for evaluation by recording the uttered words, and constitutes a final evaluation set where a correct answer text is provided for each utterance file.
The evaluation set is expressed by the following Equation 1.T={(t1,y1),(t2,y2), . . . ,(tN,yN)}  [Equation 1]where ti and yi are the ith utterance file for evaluation and a correct answer text thereof (for example, word, word sequence, or sentence), respectively.
The conventional method is performed by passing the ith utterance file ti through the speech recognizer to obtain an output text oi of a recognition result and comparing the output text oi with the correct answer text yi with respect to all i to calculate the accuracy, thereby evaluating the performance of the speech recognizer.
However, in the conventional method, the uttered speech DB for evaluation needs to be built every time the speech recognizer is exposed in different noise environments, for example, inside a moving car, an exhibit hall, or the like. To this end, a number of people need to directly utter whenever the speech signal are required to be collected for evaluation.
Moreover, when a person directly utters, the volume of the uttered speech signal is not accurately controlled. Since noise characteristics change a lot even in a specific noise environment with the passage of time, for example, in an exhibit hall, it is impossible to collect the speech signal for evaluation on all of these noise conditions.