In speech recognition systems capable of acquiring new vocabularies, in order to acquire the new vocabularies, unknown parts in speech must be estimated and pronunciations must be assigned to the unknown parts.
To estimate the unknown parts, the speech is recognized based on units shorter than a word (sub-word), such as a phoneme or a syllable. A sequence of syllables is assigned to the utterance, that is, readings in Japanese Kana are assigned. Concurrently, a score for each syllable is computed. A score for an out-of-vocabulary (OOV) word is then estimated by appropriately penalizing the scores. In the case of word recognition, since words other than normal word candidates may be unknown words, the above-described scores are used for the words other than the normal word candidates. Thus, if the utterance contains an unknown word and a score for the unknown word is between that of an incorrect word and that of a correct word, the unknown part is recognized as an unknown word. Subsequently, in order to assign a pronunciation to the unknown part, the above-described sub-word sequence, for example, the syllable sequence is referenced based on time information of the unknown part from a syllable typewriter. This allows the syllable sequence assigned to the unknown word to be estimated (for example, refer to “Proceedings of International Conference Spoken Language Processing (ICSLP) 2000” by Issam Bazzi and James R. Glass, October 2000, pp. 433-436 (hereinafter referred to as “Non-Patent Document 1), “Comparison of Continuous Speech Recognition Systems with Unknown Word Processing for Speech Disfluencies” by Atsuhiko KAI and Seiichi NAKAGAWA, Journal of the Institute of Electronics, Information and Communication Engineers of Japan, Vol. J80-D-II, pp. 2615-2625, October, 1997 (hereinafter referred to as “Non-Patent Document 2), and “Efficient Decoding Method for OOV word Recognition with Subword Models” by Hiroaki KOKUBO, Shigehiko ONISHI, Hirofumi YAMAMOTO, and Genichiro KIKUI, Journal of the Information Processing Society of Japan, Vol. 43, No. 7, pp. 2082-2090, July, 2002 (hereinafter referred to as “Non-Patent Document 3)).
Unfortunately, in the case of a syllable search, although a score for a syllable can be acquired, the boundary between words does not necessarily match the boundary between syllables. Such a mismatch between word and syllable boundaries will now be described with reference to FIG. 1.
The times corresponding to the boundaries between words acquired by word sequence search do not necessarily match the times corresponding to boundaries between sub-words acquired by sub-word sequence search. For example, as shown in FIG. 1, when the result of the word recognition is word1<OOV>word2, in terms of boundaries between <OOV> and the adjacent words, the boundaries between the words sometimes do not match the boundaries in the sub-word sequence (i.e. sub-word sequence Sy11 to Sy18). Herein, <OOV> is a symbol representing an unknown word. In FIG. 1, the boundaries before and after <OOV> temporally correspond to halfway points of Sy14 and Sy17, respectively. Accordingly, the sub-words Sy14 and Sy17, which correspond to the mismatched boundaries, are sometimes included in <OOV> and are sometimes excluded from <OOV>. To acquire the pronunciation of <OOV>, the boundaries of the sub-words must be determined.
A method for determining boundaries between sub-words by using sub-word sequences is known. The method, namely, the method for acquiring the pronunciation of <OOV> by sub-word sequences will now be described with reference to FIG. 2.
In the method for acquiring the pronunciation of <OOV> by sub-word sequences, after normal speech recognition and recognition by a syllable typewriter, if a syllable contains the time, defined by the syllable-typewriter, at each end of <OOV> and 50% or more of its duration is contained in <OOV>, the syllable becomes part of <OOV>.
For example, as shown in FIG. 2, part of a recognized word sequence is “word 1”, <OOV>, and “word 2”. Part of a sub-word sequence from a syllable typewriter is syllable i, syllable j, syllable k. In this case, since L1>L2, where L1 is the time duration of the syllable i corresponding to the word 1 and L2 is the time duration of the syllable i corresponding to <OOV>, it is determined that the syllable i is not included in <OOV>. On the other hand, since L3>L4, where L3 is the time duration of the syllable k corresponding to <OOV> and L4 is the time duration of the syllable k corresponding to the word 2, it is determined that the syllable k is included in <OOV>.
FIGS. 3 and 4 show an experimental result of the method for acquiring the pronunciation of <OOV> by sub-word sequences.
For example, an experiment by the method for acquiring the pronunciation of <OOV> by sub-word sequences shown in FIG. 2 was performed for 752 types of utterances of 12 people (6: male, 6: female) in a travel application, including utterances for hotel check-in and ordering at a restaurant. The conditions of feature parameters, an acoustic model, and a language model were set as shown in the FIG. 3. The feature parameters were set to 16-bit and 16 KHz sampling, a 10-msec frame period, a 25-msec frame length, 12th-order Mel Frequency Cepstrum Cofficients (MFCC), and first-order regression coefficient of 0 to 12th-order MFCC (25 dimensions). The acoustic model was a 16-mixture and 1000 tied-state Hidden Markov Model (HMM). The language model was a sub-word trigram, Cut-off trigram 5, and biagram 5. In this experiment, 314 types of syllables and syllable chains were used as sub-words. The language model used was a phoneme trigram trained with a corpus from six years of Nihon Keizai Shimbun articles.
FIG. 4 shows the performance in terms of recognition accuracy, substitution error, deletion error, and insertion error of sub-word sequences in percent when the method for acquiring the pronunciation of <OOV> by sub-word sequences shown in FIG. 2 is applied to the sub-word sequences. As used herein, the term “substitution error” refers to an error wherein a correct syllable is substituted by another syllable, the term “deletion error” refers to an error wherein a syllable to be recognized is not recognized at all, and the term “insertion error” refers to an error wherein a syllable not to be recognized appears in the recognition result. The recognition accuracy Acc is determined by the total number of syllables N, the number of correct answers N_C, and the number of insertion errors N_I according to the following equation (1):Acc=(N—C−N—I)/N  (1).
With reference to FIG. 4, in the method for acquiring the pronunciation of <OOV> by sub-word sequences shown in FIG. 2, the recognition accuracy was 40.2%. The deletion error rate and insertion error rate were 33.3% and 4.1%, respectively.
However, in the method for acquiring the pronunciation of <OOV> by sub-word sequences shown in FIG. 2, continuous word recognition must perform word recognition while considering the boundaries of syllables. Additionally, for example, as shown in FIG. 4, since the recognition accuracy of 40.2% is not so high and the deletion error rate of 33.3% is high, users may sometimes deem a robot incorporating this continuous speech recognition system to be unintelligent. Further, as shown in FIG. 4, compared to the deletion error rate of 33.3%, the insertion error rate of 4.1% was unbalanced.