Language models can be implemented in automatic speech recognition (ASR) to predict the most probable current word (w) given one or more history words (h). Conventionally, statistical language models, such as n-gram language models, are applied in automatic speech recognition. Statistical language models are based on estimating conditional probabilities (e.g., probability of the current word given the one or more history words, P(w|h)) using training data, such as corpora of text. In order to achieve high recognition accuracy, the length of history words can be between two to four words (e.g., 3-gram to 5-gram). As the amount of language training data used in modern ASR systems is very large, the number of n-grams in n-gram language models can be very large. Large numbers of n-grams pose memory and speed problems in run-time ASR systems. Techniques such as pruning and cut-off have been implemented to control the actual number of n-grams in an n-gram language model. However, pruning and cut-off can reduce the accuracy of speech recognition.