During speech recognition, a speech signal is decoded to identify text that the speech signal represents. In particular, decoding involves identifying a sequence of speech units from the frames of a speech signal. In the art, various sized speech units have been used in speech recognition including words, syllables and phones. In principle, larger units such as words lead to better speech recognition reliability than smaller units such as phones because the larger units place greater restrictions on the possible sequences of speech units that may be identified from the speech signal. For example, speech recognition performed on the word level will not produce words that are not found in the language. However, speech recognition performed on the phone level could produce a sequence of phones that does not represent a word in the language.
Although larger units lead to better reliability, they can also be negatively affected by speech signals that include words that are not present in a lexicon, known as out-of-vocabulary words. When an out-of-vocabulary word is in the speech signal, a word-based speech recognition system is forced to identify another word in place of the correct out-of-vocabulary word resulting in a recognition error. Generally, if 1% of all words in a language are out-of-vocabulary, there will be a 2-3% increase in word error rate in speech recognition. Phone-level speech recognition, on the other hand, is able to properly decode phone sequences for words even if the words are not found in a lexicon.
Syllables provide a middle ground between the flexibility provided by phone-level speech recognition and the reliability provided by word-level recognition. One issue in adopting syllables as speech recognition units is that the set of syllables for some languages is quite large. For example, in English, there are more than 20,000 syllables. Moreover, it is difficult to list all of the legal syllables based on a specific corpus. Thus, syllables can suffer from the out-of-vocabulary problem that affects word-based speech recognition.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.