The present invention relates to speech recognition. In particular, the present invention relates to improving new-word pronunciation by combining speech and text-based phonetic descriptions to generate a pronunciation.
In speech recognition, human speech is converted into text. To perform this conversion, the speech recognition system identifies a most-likely sequence of acoustic units that could have produced the speech signal. To reduce the number of computations that must be performed, most systems limit this search to sequences of acoustic units that represent words in the language of interest.
The mapping between sequences of acoustic units and words is stored in at least one lexicon (sometimes referred to as a dictionary). Regardless of the size of the lexicon, some words in the speech signal will be outside of the lexicon. These out-of-vocabulary (OOV) words cannot be recognized by the speech recognition system because the system does not know they exist. For example, sometimes during dictation, a user will find that a dictated word is not recognized by the system. This can occur because the system has a different pronunciation defined for a particular word than the user's pronunciation, i.e. the user may pronounce the word with a foreign accent. Sometimes, the word is not in the vocabulary at all. Instead, the recognition system is forced to recognize other words in place of the out-of-vocabulary word, resulting in recognition errors.
In a past speech recognition system, a user can add a word that was not recognized by the speech recognition system by providing the spelling of a word and an acoustic sample or pronunciation of the word with the user's voice.
The spelling of the word is converted into a set of phonetic descriptions using letter-to-sound rules. The input word is stored as the only entry of a Context Free Grammar(CFG). It is then scored by applying the acoustic sample to acoustic models of the phones in the phonetic descriptions. The total score for each of the phonetic descriptions includes a language model score. In a CFG, the language model probability is equal to one over the number of branches at each node in the CFG. However, since the input word is the only entry in the CFG, there is only one branch from the start node (and the only other node in the CFG is the end node). As a result, any phonetic description from the letter-to-sound rules always has a language model probability of 1.
In a separate decoding path, the acoustic sample is converted into a phonetic description by identifying a sequence of syllable-like units that provide the best combined acoustic and language model score based on acoustic models for the phones in the syllable-like units and a syllable-like unit n-gram language model.
The score for the phonetic sequence identified through the letter-to-sound CFG and the score for most likely sequence of syllable-like units identified through the syllable-like unit n-gram decoding are then compared. The phonetic sequence with the highest score is then selected as the phonetic sequence for the word.
Thus, under this prior art system, the letter-to-sound decoding and the syllable-like unit decoding are performed in two separate parallel paths. This has been less than ideal for a number of reasons.
First, because the two paths do not use a common language model, the scores between the two paths cannot always be meaningfully compared. In particular, since the language model for the CFG always provides a probability of 1, the score for the letter-to-sound phonetic description will usually be higher than the syllable-like unit description, which relies on an n-gram language model that is usually significantly less than 1. (The language model probability for the syllable-like units is of the order of 10-4).
Because of this, the prior art system tends to favor the phonetic sequence from the letter-to-sound rules even when the acoustic sample is better matched to the phonetic description from the syllable-like unit path.
The second accuracy problem occurs with generating pronunciations for combination words such as “voicexml”. It is important to note that the CFG path and the n-gram syllable path are independent of each other in the prior art system. Thus, a combination word like “voicexml” can result in pronunciation errors because the selected pronunciation must be either the CFG pronunciation or the n-gram syllable pronunciation. However, Letter-to-sound (LTS) rules used with a CFG engine tend to perform well on relatively predictable words, like “voice” but poorly for unpredictable words like “xml” where the correct pronunciation is almost unrelated to how it is spelled.
In contrast, the n-gram syllable model generally performs reasonably well in generating a pronunciation for words like “xml” because it attempts to capture any sequence of sounds or syllables in the acoustic sample, regardless of the spelling. However it does not perform as well as a CFG engine for a predictable word like “voice”.
For these reasons, pronunciation errors can result from combination words that combine, for example, a predictable word with an acronym such as “voicexml” if the phonetic descriptions from the two decoding systems are evaluated in two separate paths.
A speech recognition system for improving pronunciation of combination words such as “voicexml” would have significant utility.