This specification relates to language models.
Language models are used to model a probability that a string of tokens (e.g., words or characters) in a given vocabulary will appear in a language. For example, language models are used in input methods, such as, but not limited to input method editor (IME), automatic speech recognition (ASR), machine translation, handwriting recognition, and optical character recognition (OCR) applications. Modeling the probability for a string of tokens in the vocabulary is typically performed using a chain rule and calculating the probability of a given token w, in a given string context, p(w|context), where the context is the tokens in the string preceding the given token, w.
In an n-gram language model, n consecutive tokens in text are formed into n-grams, and the probability of a current word z, for example, depends on probabilities of n-1 preceding words, e.g., p(zi|context)=p(zi|zi-n+1, Zi-n+2, . . . Zi-1). An n-gram has an order, which is the number of tokens in the n-gram. For example, a 1-gram (or unigram) includes one token; a 2-gram (or bi-gram) includes two tokens.
The probabilistic distribution of n-grams in text (e.g., words in a sentence) largely depends on context, which can also be viewed in a more general sense. For example, the probabilistic distribution of particular n-grams in text can depend on a topic to be expressed by the text, or a domain that the text occurs. The probability of “basketball” occurring in a sports article is greater than the probability of “basketball” occurring in a financial article. In addition, different users may use (e.g., favor) different words, for example, to express the same idea. Users in Spain may use “football”, while users in the United States may use “soccer”. Therefore, the probabilistic distribution of n-grams in text can be both user-dependent and domain-dependent.
Conventional input methods use general language models. For example, a single language model may be used for all users. As another example, the same language model may be used or generated from training data for all domains (e.g., domains of a computer system, geographical domains). A general language model may not be optimized for all input method uses.