The following relates to processing sequences and finds particular application in systems and methods for pruning a set of sequences, such as a set of n-grams and associated statistics for a language model.
Language modeling is widely used in Natural Language Processing (NLP) for scoring a sentence with respect to a language or domain. Both character-based and word based language models have been used. Character-based language models can avoid the problem of out-of-vocabulary words that is faced when using word-based language models, and is also language independent, avoiding the need for language-specific stemmer and tokenizers. While such models may be based on recurrent neural networks (RNN), n-gram models are often used, due to their simplicity, few hyper-parameters to be tuned, and speed. One problem with n-gram models is that the size of the language model can be unwieldy. This is particularly problematic when deploying language models on computers with less powerful hardware, such as smartphones. Thus, attempts are often made to reduce the size of the model by pruning some of the n-grams and their associated corpus statistics from the model (lossy) or by finding more efficient data-structures in which to store them (lossless).
Existing pruning methods often prune n-grams that are considered uninformative or rarely used, for example, by removing n-grams that occur less than a predetermined number of times in the training data. More sophisticated methods assign a score to each n-gram, depending on the expected decrease in performance that the model will have when it is removed. Scoring functions used in such methods include probability pruning (Gao, Jianfeng, et al., “Improving language model size reduction using better pruning criteria,” Proc. 40th Annual Meeting on Association for Computational Linguistics, pp 176-182, 2002, hereinafter “Gao,” and entropy pruning (Stolcke “Entropy-based pruning of backoff language models,” arXiv preprint cs/0006025, 2000, hereinafter “Stolcke 2000”). In these methods, the score for each n-gram is assigned independently. The score for removing one n-gram does not take into account that another n-gram may also be removed. This makes for a sub-optimal choice, as a set of n-grams may have little impact independently, but their collective removal can degrade the model.
There remains a need for a system and method that provides an improvement in language model pruning.