1. Field of the Invention
The present invention relates generally to a speech processing apparatus, a speech processing method, a program, and a recording medium, and, in particular, to a speech processing apparatus, a speech processing method, a program, and a recording medium for preventing an erroneous unknown word from being acquired.
2. Description of the Related Art
To acquire an unknown word in a continuous speech recognition system having a function to acquire a new word, such as a name, that is, an unknown word not contained in a dictionary, the system needs to estimate the duration of the unknown word in the utterance and assign a pronunciation (reading) to the unknown word.
To estimate the duration of the unknown word in the utterance, the system performs speech recognition based on units shorter than a word (sub-word), such as a phoneme, a syllable, and another phonological unit. A sequence of syllables is assigned to the utterance, that is, readings in Japanese Kana are assigned to the utterance so as to acquire a score for each syllable. By appropriately penalizing the scores, a score for an out-of-vocabulary (OOV) word is then estimated. If the score for the OOV word in certain duration is higher than the score of a word contained in a dictionary, the utterance in the duration is recognized as an unknown word. The pronunciation of an unknown word is represented by a sub-word sequence (e.g., a syllable sequence) in the duration of the unknown word (refer to, for example, “Proceedings of International Conference on Spoken Language Processing (ICSLP) 2000” by Issam Bazzi and James R. Glass, October 2000, pp. 433-436, “Comparison of Continuous Speech Recognition Systems with Unknown Word Processing for Speech Disfluencies” by Atsuhiko KAI and Seiichi NAKAGAWA, Journal of the Institute of Electronics, Information and Communication Engineers of Japan, Vol. J80-D-II, pp. 2615-2625, October, 1997, and “Efficient Decoding Method for OOV word Recognition with Subword Models” by Hiroaki KOKUBO, Shigehiko ONISHI, Hirofumi YAMAMOTO, and Genichiro KIKUI, Journal of the Information Processing Society of Japan, Vol. 43, No. 7, pp. 2082-2090, July, 2002).
Unfortunately, when performing a speech recognition process based on a unit of syllable to estimate the duration of an unknown word, the boundary between words does not necessarily match the boundary between syllables.
Such a mismatch between word and syllable boundaries, that is, a mismatch between boundaries of a word sequence and a sub-word sequence is described next with reference to FIG. 1.
For example, as shown in FIG. 1, when the result of the word speech recognition is “word1”<OOV>“word2”, in terms of boundaries between <OOV> and the adjacent words, the boundaries between the words sometimes do not match the boundaries in the sub-word sequence (i.e., sub-word sequence Sy11 to Sy18). As used herein, <OOV> is a symbol representing an unknown word. “word1” and “word2” are words contained in a dictionary (i.e., known words).
In the example shown in FIG. 1, the earlier boundary of <OOV> temporally corresponds to the halfway point of Sy14, and the later boundary of <OOV> temporally corresponds to the halfway point of Sy17. Accordingly, the sub-words Sy14 and Sy17, which correspond to the mismatched boundaries, are sometimes included in <OOV> and are sometimes excluded from <OOV>. To acquire the pronunciation of <OOV>, it is desirable that the boundaries of the sub-words be determined.
A method for acquiring the pronunciation of <OOV> by determining the boundaries of a sub-word (i.e., the boundaries of duration of an unknown word) is known as the method for acquiring the pronunciation of <OOV> by use of sub-word sequences.
The method for acquiring the pronunciation of <OOV> by use of sub-word sequences is described next with reference to FIG. 2.
In the method for acquiring the pronunciation of <OOV> by use of sub-word sequences, if 50% or more of the duration of a syllable containing either boundary of <OOV> is contained in <OOV>, the syllable is considered to be part of <OOV>.
For example, as shown in FIG. 2, part of a normally recognized word sequence is “word1”, <OOV>, and “word2”. Part of a sub-word sequence from a phonetic typewriter is syllable i, syllable j, syllable k. In this case, since L1> L2, where L1 is the time duration of syllable i corresponding to word1 and L2 is the time duration of syllable i corresponding to <OOV>, it is determined that syllable i is not included in <OOV>. On the other hand, when considering a duration L3+L4 of syllable k containing a temporally later boundary of OOV>, since L3> L4, where L3 is the time duration of syllable k corresponding to <OOV> and L4 is the time duration of syllable k corresponding to word2, it is determined that syllable k is included in <OOV>.
FIG. 3 shows an experimental result of the method for acquiring the pronunciation of <OOV> by use of sub-word sequences shown in FIG. 2.
An experiment by the method for acquiring the pronunciation of <OOV> by use of sub-word sequences shown in FIG. 2 was performed for 752 types of utterances of 12 people (6: male, 6: female) in a travel application, including utterances for hotel check-in and ordering at a restaurant. The conditions of feature parameters, an acoustic model, and a language model were set as shown in the FIG. 4. The feature parameters were set to 16-bit and 16-KHz speech sampling, a 10-msec frame period, a 25-msec frame length, 12th-order Mel Frequency Cepstrum Coefficients (MFCC), and first-order regression coefficient of 0 to 12th-order MFCC (25 dimensions). The acoustic model was a 16-mixture and 1000 tied-state Hidden Markov Model (HMM). The language model was a sub-word trigram, Cut-off trigram 5, and biagram 5. In this experiment, 314 types of syllables and syllable chains were used as sub-words. The language model used was a phoneme trigram trained with a corpus from six years of NIKKEI Shimbun (Nihon Keizai Shimbun) articles.
FIG. 3 shows recognition accuracy, substitution error, deletion error, and insertion error of sub-word sequences in percent when acquiring the pronunciation of <OOV> using the method for acquiring the pronunciation of <OOV> by use of sub-word sequences shown in FIG. 2. As used herein, the term “substitution error” refers to an error wherein a correct syllable is substituted by another syllable, the term “deletion error” refers to an error wherein a syllable to be recognized is not recognized at all, and the term “insertion error” refers to an error wherein a syllable not to be recognized appears in the recognition result. The recognition accuracy Acc is determined by the total number of syllables N, the number of correct answers N_C, and the number of insertion errors N_I according to the following equation: Acc=(N_C−N_I)/N.
As shown in FIG. 3, in the method for acquiring the pronunciation of <OOV> by use of sub-word sequences shown in FIG. 2, the recognition accuracy was 40.2%. The substitution error rate, deletion error rate, and insertion error rate were 22.4%, 33.3%, and 4.1%, respectively.