1. Field of the Invention
The present invention relates to an apparatus, method and program for estimating the performance of a voice recognition apparatus.
2. Description of the Related Art
To develop a voice recognition apparatus, it is necessary to estimate whether the apparatus can exhibit performance as expected. The performance of the voice recognition apparatus is estimated by inputting thereto voice data recorded under various conditions, and analyzing the recognition results of the voice data. For instance, if the recognition rate of the apparatus is analyzed, the conditions under which the apparatus can exhibit good performance are detected, and the performance under such conditions can also be acquired as a numerical value, such as a recognition rate (see, for example, Proc. of the spring meeting of the acoustic society of Japan, published March 2003, pp. 159-160, “An Evaluation Method of ASR Performance by HMM-based Speech Synthesis” by R. Terashima et al.). Furthermore, if the voice recognition apparatus does not exhibit expected performance, its causes can be detected in detail by analyzing the features of the erroneous recognition results, and can be utilized to improve the performance.
Various items can be utilized for estimating the performance of voice recognition apparatuses. The following three items are typical ones: (1) Variations in vocabulary sets that can be detected by the voice recognition apparatus; (2) variations due to speakers (the sex of a speaker, the speed of speech, voice tone, intonation, accent, etc.); and (3) variations due to the environment (noise, the characteristics of a microphone, the characteristics of a voice transmission system, etc.). Concerning each estimation item, a number of voices obtained under various conditions are input to the voice recognition apparatus to determine whether the apparatus exhibits good performance for each variation of each item. If the voice recognition apparatus exhibits good performance for any variation, i.e., if it shows a small range of differences in performance under various conditions, it is determined to be an ideal voice recognition apparatus.
To estimate the performance of the voice recognition apparatus, the apparatus is analyzed for various points of view concerning the above-mentioned estimation items. The following two points of view are typical ones: (1) General performance of the voice recognition apparatus concerning various items is checked (see, for example, Lecture Article Papers of the Acoustical Society of Japan, published Autumn 1999, pp. 169-170, “Large-Scale Japanese Voice Database in Light of Wide Range of Districts and Ages”, written by Matsui, Naito, et al.). To estimate the basic performance of the voice recognition apparatus, it is necessary to estimate the performance of the apparatus concerning all items. (2) The performance of the voice recognition apparatus under a particular condition is checked. To estimate the performance of a voice recognition apparatus for a particular purpose, the performance of the apparatus is checked concerning variations in the items other than the items that do not have variations, i.e., the items that can be fixed in estimation. Specifically, to estimate the performance of a voice recognition apparatus “whose detectable vocabulary sets are fixed” and “which is dedicated to men only”, the performance of the apparatus is estimated concerning variations in the items other than the above items, with the vocabulary sets and the sex of the speakers fixed. In general, the to-be-estimated items depend upon the purpose of each voice recognition apparatus.
The following methods can be used to estimate the performance of voice recognition apparatuses in the above-mentioned points of view.
(1) To check the general performance of a voice recognition apparatus concerning various items, it is necessary to prepare a large number of sets of voice data for estimation that sufficiently cover the variations of all items. After checking variations of the estimation voice data sets in units of estimation items, the recognition performance of the apparatus for each variation is determined by, for instance, a statistical method from the recognition results of the apparatus concerning the checked variations. As a result, the performance concerning all variations can be determined.
(2) To check the performance of the voice recognition apparatus under a particular condition, it is needed to collect or newly record estimation voice data sets that cover variations in each of estimation items to be considered under the particular condition. In particular, when the design of the voice recognition apparatus concerning detectable vocabulary sets is changed, voice data corresponding to the changed vocabulary sets must be newly recorded. The performance of the voice recognition apparatus under a particular condition can be determined concerning each estimation item by checking variations in estimation data set for each item, and determining the recognition performance of the apparatus concerning each variation by, for example, a statistical method.
As stated above, to estimate the performance of a voice recognition apparatus, it is necessary to prepare sets of voice data for estimation corresponding to an estimation point of view. However, considerable time and expense are required to record a large amount of voice data used as estimation voice data sets.
In the case (1) where general performance of a voice recognition apparatus concerning various estimation items is determined, it is expensive to prepare a large number of sets of estimation voice data that cover variations in all estimation items. Even if such data sets could be prepared, new or additional recording of estimation voice data will be needed when estimation of items, which were not expected at the time or the above-mentioned preparation, has become necessary, or when the number of variations in a certain item needs to be increased. In such cases, further time and expense are entailed.
On the other hand, in the case (2) where the performance of a voice recognition apparatus under a certain condition is checked, if estimation voice data sets do not exist which cover variations in an estimation item to be considered under a particular condition, it is necessary to newly record voice data. This entails considerable time and expense. Thus, to estimate the performance of a voice recognition apparatus, considerable time and expense are involved in preparing voice data for estimation.
It is possible to artificially change already existing estimation voice data sets to cover various estimation items. For instance, concerning estimation items related to the environment (such as noise and microphone characteristics), variations can be added relatively easily by superimposing noise or combining the microphone characteristics and estimation voice data. However, it is very difficult to artificially change the speed of speaking or the tone of already existing voice data. It is almost impossible to modify already existing voice data to change the sex of the speaker of the voice data or the contents of the voice data. Therefore, the above-mentioned problem cannot be solved by the method of artificially changing already existing estimation voice data sets.