[Tsutsumi, Katoh, Kosaka and Kohda, “Lecture Speech Recognition Using Pronunciation Variant Modeling”, The transactions of the Institute of Electronics, Information and Communication Engineers J89-D (2), 305-313, 2006], and [Akita and Kawahara, “Generalized Statistical Modeling of Pronunciation Variations for Spontaneous Speech Recognition”, The transactions of the Institute of Electronics, Information and Communication Engineers J88-D2 (9), 1780-1789, 2005] describe an example of a pronunciation variation rule extraction apparatus. As shown in FIG. 1, this pronunciation variation rule extraction apparatus 200 is configured to include base form pronunciation storage means 201, surface form pronunciation storage means 202, difference extraction means 203 and pronunciation variation counter means 209.
The pronunciation variation rule extraction apparatus 200 having such a configuration is operated as follows. That is, the difference extraction unit 203 extracts transcription texts from the base form pronunciation storage unit 201 and the surface form pronunciation storage unit 202, respectively, and extracts differences, namely, different portions.
Here, the base form pronunciation storage unit 201 and the surface form pronunciation storage unit 202 store the transcription texts as results of transcription of pronunciation content of speech data of a long time. More specifically, the base form pronunciation storage unit 201 stores the following transcription text, for example.
“Sono youna shujutsu wo hobo mainichi okonai mashi ta (in Hiragana)”
The surface form pronunciation storage unit 202 stores in a format corresponding to the transcription text stored in the base form pronunciation storage unit 201, for example, the following transcription text.
“Sono youna shijitsu wo hobo mainchi okonai mashi ta (in Hiragana)”
The base form pronunciation storage 201 stores a base form pronunciation of the speech data serving as original, namely, a proper pronunciation to be observed when proper pronunciation is carried out, as the transcription text. On the other hand, the surface form pronunciation storage 202 stores the transcription text in which, when the speech data is actually heard by a human, the pronunciation as heard is strictly transcribed. In the above example, correspondingly to the base form pronunciations of [“shujutsu (in Hiragana)” (surgery)] and [“mainichi (in Hiragana)” (every day)], the surface form pronunciations of [“shijitsu (in Hiragana)”] and [“mainchi (in Hiragana)”] are stored respectively.
The difference extractor 203 compares the base form transcription text and the surface form transcription text, and extracts letter string pairs of different portions. In the above example, a pair of [“shujutsu (in Hiragana)”] and [“shijitsu (in Hiragana)”] and a pair of [“mainichi (in Hiragana)”] and [“mainchi (in Hiragana)”] are extracted. Hereafter, these pairs are referred to as pronunciation variation examples. A pronunciation variation example in which a base form pronunciation and a surface form pronunciation are same, namely, there is no deformation is especially referred to as an identical pronunciation variation.
The pronunciation variation counter unit 204 receives the pronunciation variation examples from the difference extraction unit 203, classifies them with respect to the same base form and the same surface form, and counts observation number such that the identical pronunciation variation is included. Moreover, the counted results are normalized and converted into probability values. For example, in the above example, it is supposed that there are [“mainichi (in Hiragana)” (identical deformation)], [“mainchi (in Hiragana)”], [“maichi (in Hiragana)”], and [“man-ichi (in Hiragana)”] as surface form pronunciations corresponding to the base form pronunciation [“mainichi (in Hiragana)”] and that they are observed 966 times, 112 times, 13 times and 2 times, respectively. Since observation number of the base form pronunciation [“mainichi (in Hiragana)”] is 966+112+13+2=1093, the converted probability values are respectively as follows:
“mainichi (in Hiragana)”→“mainichi (in Hiragana)”
0.884 (966/1093);
“mainichi (in Hiragana)”→“mainchi (in Hiragana)”
0.102 (112/1093);
“mainichi (in Hiragana)”→“maichi (in Hiragana)”
0.012 (13/1093); and
“mainichi (in Hiragana)”→“man-ichi (in Hiragana)”
0.002 (2/1093). These results can be interpreted as a probability rule with regard to appearance tendencies of the surface form pronunciations corresponding to the base form pronunciation [“mainichi (in Hiragana)”]. The pronunciation variation counter unit 204 outputs the above results as a pronunciation variation rule.
Although the base form pronunciation and the surface form pronunciation are dealt with for word unit in the above example, it should be noted that they can be dealt with for another unit, for example, series of phoneme (minimum unit configuring speech, such as vowels and consonants, or the like) having a predetermined length. Also, when the probability values are calculated, there may be carried out a proper smoothing operation, for example, a neglect of a minor pronunciation variation of which observation number is smaller than a predetermined value.
[Ogata and Ariki, “Study of Spontaneous Speech Recognition in Which Pronunciation Deformation and Acoustic Error Trend Are Considered”, Lecture Paper Collection of 2003 Spring Meeting of Acoustical Society of Japan, pp. 9-10, March 2003] and [Ogata, Goto and Asanao, “Study of Dynamic Pronunciation Modeling Method for Spontaneous Speech Recognition”, Lecture Paper Collection of 2004 Spring meeting of Acoustical Society of Japan, pp. 203-204, March 2004] describe another example of a pronunciation variation rule extraction apparatus. As shown in FIG. 2, this pronunciation variation rule extraction apparatus 300 is configured to include a speech data storage unit 301, a base form pronunciation storage unit 302, a syllable dictionary storage unit 303, an acoustic model storage unit 304, a speech recognition unit 305, a difference extraction unit 306 and a pronunciation variation counter unit 307.
The pronunciation variation rule extraction apparatus 300 having such a configuration is operated as follows. That is, the speech recognition unit 305 uses a dictionary stored in the syllable dictionary storage unit 303 and acoustic models stored in the acoustic model storage unit 304 to perform a known continuous syllable recognition process on speech data stored in the speech data storage unit 301, and then outputs a syllable series as the recognition result.
Here, in a case of Japanese, the dictionary stored in the syllable dictionary storage unit 303 is a list that records various syllables, such as “a, i, u, e, o, ka, ki, ku, ke, ko, (in Hiragana)”, and is provided for each syllable with a pointer to the acoustic model such that, acoustic feature of the syllable can be referred. Even in a case of another language, it is possible to configure the dictionary by defining a proper unit which corresponds to the language. Also, the acoustic model stored in the acoustic model storage unit 304 is a model in which acoustic feature with regard to predetermined recognition unit, namely, syllable, phoneme or the like is described in accordance with a method such as a known hidden Markov model.
The difference extraction unit 306 receives: the recognition result from the speech recognition unit 305; and transcription text from the base form pronunciation storage unit 302, respectively, and extracts differences between them, namely, different portions. Here, the transcription text stored in the base form pronunciation storage unit 302 is similar to the transcription text stored in the base form pronunciation storage unit 201 in FIG. 1 and correlated to the speech data stored in the speech data storage unit 301. Namely, stored as the transcription text is a proper pronunciation to be observed when the content of the speech data in the speech data storage unit 301 is properly pronounced. The pronunciation variation counter unit 307, through an operation similar to that of the pronunciation variation counter unit 204 in FIG. 1, receives pronunciation variation examples from the difference extractor 306 and outputs a pronunciation variation rule.
[Onishi, “Extraction of Phonation Deformation and Expansion of Recognition Dictionary in Consideration of Speaker Oriented Property of Recognition Error”, Lecture Paper Collection of 2007 spring meeting of Acoustical Society of Japan, pp. 65-66, March 2007] describes still another example of a pronunciation variation rule extraction apparatus. As shown in FIG. 3, this pronunciation variation rule extraction apparatus 400 is configured to include a speech data storage unit 401, a base form pronunciation storage unit 402, a word language model/dictionary storage unit 403, an acoustic model storage unit 404, a speech recognition unit 405, a difference extraction unit 406 and a pronunciation variation counter unit 407.
The pronunciation variation rule extraction apparatus 400 having such a configuration is operated as follows. That is, the speech recognition unit 405 uses a language model and a dictionary stored in the word language model/dictionary storage unit 403 and acoustic models stored in the acoustic model storage 404 to perform a known continuous word recognition process on speech data stored in the speech data storage unit 401, and then outputs word series as the recognition result.
Here, a dictionary and a language model that are installed in a typical large vocabulary speech recognition system can be used as the dictionary and the language model that are stored in the word language model/dictionary storage unit 403. The dictionary includes several tens of thousands of words each of which is provided with a pronunciation thereof and a pointer to an acoustic model for referring acoustic features. The language model is based on a known n-gram model and is a model in which, when an array of n−1 words is given, probabilities of appearances of words as the next word are defined.
The acoustic model stored in the acoustic model storage unit 404 is, as same as the acoustic model stored in the acoustic model storage unit 304 in FIG. 2, a model in which acoustic feature with regard to predetermined recognition unit, namely, syllable, phoneme or the like is described in accordance with a method such as a known hidden Markov model.
The difference extraction unit 406, through an operation similar to that of the difference extraction unit 306 in FIG. 2, receives: recognition result from the speech recognition unit 405; and transcription text from the base form pronunciation storage unit 402, respectively, and extracts differences between them, namely, different portions. Here, the transcription text stored in the base form pronunciation storage unit 402 is similar to that of the base form pronunciation storage unit 302 in FIG. 2, and is required to be correlated to the speech data stored in the speech data storage unit 401. The pronunciation variation counter unit 407 receives, through an operation similar to that of the pronunciation variation counter unit 204 in FIG. 1 or the pronunciation variation counter unit 307 in FIG. 2, pronunciation variation examples from the difference extractor unit 406 and outputs a pronunciation variation rule.
A first problem with respect to the pronunciation variation rule extraction apparatuses 200, 300 and 400 described in those five documents lies in a fact that a large amount of effort is required to obtain the pronunciation variation rule and the pronunciation variation examples based on which the rule is obtained. The reason is that base form pronunciations and surface form pronunciations corresponding thereto are required to be prepared in a large amount. In order to acquire a pronunciation variation rule of high acceptability, in the case of the pronunciation variation rule extraction apparatus 200 in FIG. 1, the base form pronunciations to be stored in the base form pronunciation storage 201 and the surface form pronunciations to be stored in the surface form pronunciation storage 202 are required to be prepared in advance by performing a large number of transcription of the speech data. However, the preparation of the base form pronunciations and the surface form pronunciations, in particular the preparation of the latter, requires long time and large effort because an expert familiar with the listening of speech is required to carefully listen a speech and to transcribe surface form pronunciation that is ambiguous and has a difficulty in judging, as a letter string.
A second problem is a difficulty in obtaining a pronunciation variation rule having a high generalization property. This is because it is difficult to obtain accurate pronunciation variation example from speech data of spontaneous speech. For example, as for the pronunciation variation rule extraction apparatus 200 in FIG. 1, the surface form pronunciations are transcribed by experts. Here, in general, many experts share the work in order to obtain the large quantity of transcriptions. However, since the pronunciation of the speech is essentially ambiguous, the subjectivities of the experts are greatly included in the transcriptions, and then discrepancies are generated in the transcription results. In the pronunciation variation rule extraction apparatus 300 in FIG. 2, the speech recognition unit can automatically obtain the surface form pronunciations based on a unified standard. However, under the current technical level of the speech recognition, it is very difficult to accurately carry out the continuous syllable recognition process for determining the array of syllables without linguistic background knowledge. For example, when the continuous syllable recognition is performed on a phonation of [“Hiroshima (in Hiragana)”], the result far from the actual pronunciation variation is often obtained such as [“kerusema (in Hiragana)”] or [“karurika (in Hiragana)”]. That is, even if the continuous syllable recognition is applied, only the letter string that is random and poor in usefulness is obtained.
Also in the pronunciation variation rule extraction apparatus 400 in FIG. 3, although the background knowledge such as the word dictionary and the language model is available, the problem of the inaccuracy of the speech recognition still remains similarly to the pronunciation variation rule extraction apparatus 300 in FIG. 2. Moreover, in the pronunciation variation rule extraction apparatus 400 in FIG. 3, since the word dictionary and the language model act as linguistic constraints in the speech recognition process, the obtained pronunciation variation examples are influenced by the word dictionary and the language mode. Thus, in general, the pronunciation variation examples that differ from the actual pronunciation variation phenomenon are obtained. For example, the phenomenon in which [“sentakuki (in Hiragana)” (laundry machine)] is changed to [“sentakki (in Hiragana)”] or [“shokupan (in Hiragana)” (pullman loaf)] is changed to [“shoppan (in Hiragana)”] is found in general. However, in the pronunciation variation rule extraction apparatus 400 in FIG. 3, the speech recognition result is only obtained as the combination of words included in the word dictionary. Thus, there is no guarantee that the recognition result corresponding to the pronunciation [“sentakki (in Hiragana)”] is obtained.