This invention relates to multi-lingual speech recognition with context modeling.
Speech recognition systems have been developed to recognize words and longer utterances in a number of languages. Many current speech recognizers make use of phonetic sub-word units to represent of words, and statistical parameters that are associated with those sub-word units are estimated from training speech. Speech recognizers that are tailored for particular languages typically make use of an underlying set of sub-word units that are most appropriate to that language. These sets of sub-word units often differ significantly from one language to another. For example, a set of phonetic units that would typically be used for English can be very different than a set of syllable-based units that may be used for Chinese. Not only are the units different, the distinctions between different units may be based on features unique to the language. For example, Chinese units may differ according to their “tone” (a function of the pitch contour) while English units would not typically address differences in tone.
One problem that has been addressed is the transfer of statistical information obtained from data in one language to enable or improve speech recognition in another language. For example, it may be desirable to use to training data from one language to configure a speech recognizer in another language. However, in general, speech recognizers that are developed for different languages typically use different sets of sub-word units, which are appropriate for that language. One solution to this problem that has been proposed is to train speech recognizers using a universal set of subword units. For example, multi-lingual speech recognizers have been developed in which words of all the supported languages are represented in the International Phonetic Alphabet (IPA), or one of a number of similar multi-language sets of subword units (e.g., WorldBet, SAMPA).
In both single-language and multi-language speech recognition, an important technique for improving accuracy of speech recognizers is to use context-dependent subword units. Statistical parameters for a context-dependent subword unit are based on the context of adjacent units in a word, and at word boundaries adjacent units in adjacent words (cross-word context). One approach to selection of context-dependent units is to use a decision tree to identify variants of a unit that depend on the adjacent context. In general, the decision tree uses the identities or characteristics of the adjacent units to select the appropriate context-dependent variant.