In general, since a speech recognition system is trained by the speech of a native speaker, the speech recognition system is optimized for the speech of a native speaker. Therefore, such a speech recognition system has an excellent performance in the recognition for the speech of a native speaker, but has a poor performance in the recognition for the speech of a non-native speaker.
FIG. 1 shows an example of the effect of recognition performance for a Korean speaker's English pronunciation in an English speech recognition system. That is, with an English speech recognition system which has been trained by using English speech training data by the speech of a native English speaker, when the English training speech data are recognized by a native speaker, a word error rate of 4.21% occurs. In contrast, when the English training speech data are recognized by a Korean speaker, a word error rate of 39.22% occurs, which shows that the recognition performance of the speech recognition system is considerably degraded.
In order to enhance the recognition performance for a non-native speaker, there is a method of making a recognition system trained by the training speech data by a non-native speaker. However, database of non-native speakers' speeches for training of the speech recognition system is not yet sufficient. Currently, as an English speech database for training of the speech recognition system by Korean speakers, “English Pronunciation by Korean” is provided by the Speech Information Technology & Industry Promotion Center.
A continuous speech recognition system roughly includes two modules (i.e., a feature vector extraction module and a speech recognition module) and three models (an acoustic model, a pronunciation model and a language model), as shown in FIG. 2. In other words, when speech is inputted to the continuous speech recognition system, a feature vector is extracted from the inputted speech through the feature extraction module. Generally, in order to create a speech recognition system, 12 Mel Frequency Cepstral Coefficients (MFCCs), log energy, and their first and second order derivatives are used as a feature vector. For a feature vector extracted from a speech, an acoustic model, a pronunciation model, a language model, etc. are found from the speech recognition module. Therefore, studies for performance enhancement of a speech recognition system for non-native speakers are classified into an acoustic model point of view, a pronunciation model point of view, and a language model point of view. The present invention is proposed in consideration of the acoustic model point of view.
According to the acoustic model theory of view, the acoustic model of a speech recognition system is adapted in order to enhance the recognition performance for a non-native speaker. This is roughly classified into an acoustic model retraining method by the speech of a non-native speaker, and an acoustic model adaptation method using the speech of a non-native speaker. First, the acoustic model retraining method requires a great amount of speech from non-native speakers, and also greatly degrades the recognition performance for native speakers while enhancing the recognition performance for non-native speakers.
FIG. 3 shows an example of recognition performance when an acoustic model is retrained by a Korean speaker's English speech. That is, referring to FIG. 3, when input speech is recognized by the acoustic model retrained by the Korean speaker's English speech, a word error rate of a Korean speaker's English speech is 26.87% and thus the word error rate is relatively reduced by about 31.49%, but a word error rate of a native speaker's English speech is 42.07% and thus the word error rate is relatively increased by about 899.29%. Consequently, it can be understood that the average word error rate of the Korean speaker's English speech and the native speaker's English speech is 34.47% and thus the word error rate is relatively increased by about 58.71%. For this reason, the acoustic model adaptation method using speech of a non-native speaker is widely used, instead of the acoustic model retraining method by the speech of a non-native speaker. Representative acoustic model adaptation methods include a maximum likelihood linear regression (MLLR) scheme and a maximum a posteriori (MAP) scheme.
FIG. 4 shows an example of the average recognition performance of a Korean speaker's English speech and a native speaker's English speech when the MLLR and MAP adaptation schemes are applied to an acoustic model trained by a native speaker's English speech. In this case, since recognition performances for a native speaker's English speech are similar regardless of whether or not an acoustic model adaptation scheme is applied, only the average recognition performance will be considered. When the MAP is applied, a word error rate is 9.80% and thus the word error rate is relatively reduced by about 54.88%. When the MLLR is applied, a word error rate is 12.81% and thus the word error rate is relatively reduced by about 41.02%. When the MLLR and MAR are combined and used, a word error rate is 10.26% and thus the word error rate is relatively reduced by about 52.72%. Accordingly, it can be understood that when an acoustic model adaptation scheme, such as the MAP, MLLR, etc., is applied, the average recognition performance of a Korean speaker's English speech and a native speaker's English speech is enhanced.