1. Technical Field
The present invention relates to the field of speech recognition, and in particular, to reducing the available speech elements within a speech grammar during a dialog.
2. Description of the Related Art
In speech recognition systems such as ViaVoice®, speech recognition can be performed by receiving a user spoken utterance through an input device such as a microphone or a headset. The received user spoken utterance can be analyzed and converted into speech elements. The analyzed speech elements and speech elements accumulated in a database can be compared. Thus characters and words that correspond to the entered speech elements can be extracted. Notably, the speech elements accumulated in the database need not be individually or independently stored, but rather can be stored relating to a grammar which follows particular kinds of rules. For example, in the case of recognizing a four-digit number as shown in FIG. 9(a), four digits of <num1> are defined as <digits> wherein a predetermination has been made that Arabic numbers from 0 to 9 can be entered. Under this grammatical definition, a speech elements expression table can defined as shown in FIG. 9(b). Specifically, “0” can correspond to the four speech elements of “ree”, “ree:”, “rei”, and “zero”. Similarly, “1” can correspond to “ichi”, a number “2” can correspond to three speech elements, “3” to one speech element, “4” to four speech elements, etc. FIG. 9(c) shows an example where the grammar of FIG. 9(a) has been applied to the speech elements expression of FIG. 9(b). The grammar and the speech elements expression of FIG. 9(c) can be used as practical base forms.
If received speech corresponding to <digits> is “zeroichinii:san”, the speech can be analyzed into speech elements wherein “zero”, “ichi”, “nii:” and “sa—n” can be obtained. In that case, the numbers “0”, “1”, “2”, and “3” corresponding to each speech element can be obtained from the speech elements correspondence table. Each number can be applied to the grammatical definition such that the four characters “0123” can be obtained as a recognition result for <digits>.
In speech recognition systems such as ViaVoice®, a method for improving recognition accuracy called enrollment can be adopted. Enrollment can detect individual differences of received speech and study acoustic characteristics that fit each individual. When the reading of numbers in the Japanese language is considered, however, speech recognition accuracy of such numbers is not always high.
Several possible factors can be identified, each of which can decrease speech recognition system accuracy. One factor can be that the Japanese words for numbers such as “ichi”, “ni” and “san” are generally short and have less sound prolixity. There can be little difference among speech elements of a short word. Thus, misunderstanding of speech elements can easily occur during speech recognition. Other Japanese words for numbers can be comprised of one syllable such as “ni”, “shi”, “go” and “ku”. The decreased sound prolixity for these words can be even more conspicuous.
Another factor can be that some Japanese words for numbers can be represented by a plurality of readings, speech elements, or pronunciations. For example, readings such as “zero”, “rei” and “maru” can correspond to a number “0”; “shi” and “yon” to “4; “nana” and “shichi” to “7; and “kyuu:” and “ku” to “9”. When a plurality of readings correspond to a single number, the number of speech element candidates to be recognized is increased. This can cause a higher probability of erroneous speech recognition.
Another factor can be that similar speech elements exist in different numbers. For example, the speech elements of “shichi” (7), “ichi” (1) and “hachi” (8) are similar to one another, as are the speech elements “shi” (4) and “shichi” (7). Additionally, the speech elements of “ni” (2) and “shi” (4) are similar, as well as those of “san” (3) and “yon” (4). Discrimination between such similar speech elements can be difficult due to the similarity of sound. As a result, erroneous recognition can become more probable. The problem can become more conspicuous where speech recognition is performed over a telephone line and the like where the available channel bandwidth is limited. For example, discriminating speech having the vowel “i” which requires recognition of a low frequency component can become more difficult with a limited bandwidth.
Other factors can include the pronunciation of words having one syllable with a long vowel wherein the long vowel is not necessarily included or pronounced in every situation. In that case, discrimination of such syllables can be difficult. Pronunciations such as “ni”, “nii:”, “nii:nii:” and “go”, “goo:”, “goo:goo:” are examples. Particularly, the character “5” which is usually pronounced “goo:” can be pronounced as “shichigosan” in the case of “753” and also can be pronounced “sangoppa” in the case of “358”. “Goo;” further can be pronounced as “go” or “go” with a very short vowel and a plosive, which further can complicate the problem.
Speech recognition of numbers via telephone and the like, is commonly used in various business applications. Examples can include entering member numbers, goods selection numbers, etc. Consequently, there can be significant benefits to the improvement of speech recognition of numbers, especially with regard to the development of business applications.
It should be appreciated that enrollment can improve speech recognition accuracy to a certain extent by matching acoustic characteristics to individuals. Further improvement of speech recognition accuracy, however, can be limited in the case where received speech elements are similar to each other and the speech elements do not have prolixity as described above.