1. Field of the Disclosure
The present disclosure relates to data structures for storing information for document collections and, in particular, relates to systems and methods for populating and accessing such data structures for word n-gram collections to provide for efficient memory use and enhanced speed of use.
2. Background Information
N-gram language models (LMs) (or n-gram models) for word-level n-grams (i.e., n-grams of words, or word n-grams) have become increasingly of interest as a resource for data-driven algorithms in many areas of computational linguistics. Along with traditional use for speech recognition, n-gram models have been embraced for a variety of text-processing tasks, such as statistical machine translation, text tagging, spelling correction, named entity recognition, word sense disambiguation, lexical substitution, etc. An n-gram of words is a sequence of words where n is an integer. For instance, a single word (e.g., the) is a word 1-gram, a sequence of two words (e.g., the dog) is a word 2-gram, a sequence of 3 words (e.g., the dog barked) is a word 3-gram, etc. A n-gram model as referred to herein is a language model that includes a collection of specified word n-grams (or word n-gram collection) and that provides some statistical information about the word n-grams, e.g., the count for a given word n-gram, which the number of times a given word n-gram appears in the corpus (or corpora) from which the word n-gram collection was generated. So, for instance, a corpus of documents or textual information could be parsed (sequenced) to identify specific word n-grams therein to create a word n-gram collection. Appending statistical information to the word n-gram collection, e.g., the count for each n-gram or frequency of occurrence, for instance, can yield an n-gram model. N-gram models may conventionally be stored in a database (which may be referred to as a n-gram database), which may typically be in the form of a table or index, such that a given record (or row of the table) has a field containing data representing the word n-gram and another field containing data representing the count or frequency for that word n-gram. An n-gram model can include more than just word sequences; it can also include, for instance, information regarding symbols, punctuation, or other linguistic sequences with meaning but not conventionally characterized as words. For brevity, the terminology “n-gram” may be used herein to refer to word-level n-grams as well as other types of n-grams.
Larger and larger text corpora have become available, such as the Google Web1T corpus (Brants and Franz, 2006), for example, and the World Wide Web (“the Web”) itself can be used as a corpus. The present inventor has observed that such large corpora combined with increasingly large n-gram lengths (e.g., 5-grams or even higher order) will call for the development of very large word n-gram models, requiring n-gram databases with billions of n-grams and multi-gigabyte storage sizes.
Conventional approaches for storing and deploying very large n-gram models may use file-based indexing (e.g., flat files with additional inverted indexing for search), relational databases (e.g., with SQL-based querying capabilities), or specialized systems such as language model (LM) toolkits (e.g., with custom-engineered encoding approaches). Prefix trees and randomized hashing are two prominent examples of specialized systems. The specialized systems typically utilize the architecture of prefix trees (also called “prefix tries,” where each symbol in sequence is represented by a node in a tree structure), or the architecture of randomized hashing (which uses random hash functions to map n-gram strings to their associated counts/probabilities).
The present inventor has recognized that for wider adoption and use of n-gram models, both for research and for practical purposes, there is a need for greater simplicity and greater efficiency for working with n-gram models containing very large collections of n-grams.