Statistical language models such as N-gram models are commonly used to convert or translate one language to another by assigning a probability, Pr(W), to a sequence of words W using a probability distribution. Such language models are typically trained from a large body of texts (referred to as a corpus) and, generally, capture the frequencies of the occurrence of each word and/or each sequence of two or more words within the corpus. Conventionally, the occurrence of a particular word in the corpus is accounted for by the training of the language model irrespective of its use and/or reading in each particular context. While most words in the corpus and in general are each associated with one meaning and possibly more than one correct pronunciation, certain words are written identically but have different meanings and pronunciations/readings (i.e., heteronyms). For example, an example of a heteronym in the English language is “desert,” which in one context and usage/pronunciation means “to abandon,” and in another context and usage/pronunciations means “a dry, barren area of land.” Thus, by accounting for the frequency of the word “desert” without regard to the context of its use in a corpus, any distinctions of frequencies of use of the word in the first sense (“to abandon”) and a second sense (“a dry, barren area of land”) are most likely overlooked by the conventional language model.
Pinyin is a standard method for transcribing Mandarin Chinese using the Roman alphabet. In a pinyin transliteration, the phonetic pronunciations/readings of Chinese characters are mapped to syllables composed of Roman letters. Pinyin is commonly used to input Chinese characters into a computer via a conversion system. Such a system often incorporates a statistical language model to improve conversion accuracy. Certain Chinese characters have multiple pronunciations/readings (i.e., heteronymous Chinese characters). However, the conventional language model that does not distinguish between different pronunciations/readings of heteronyms can sometimes produce undesirable Chinese conversion candidates for pinyin that is associated with heteronymous Chinese characters.