1. Field of the Invention
The invention generally relates to statistical language models and in particular to a compact representation of lexical information including linguistic frequency data.
2. Description of Related Art
Statistical language models are important prerequisites for many natural language processing tasks, such as syntactic parsing or coding of distorted natural language input. Tasks related to natural language processing include improved methods of optical character recognition that use statistical information on word co-occurrences to achieve reduced error rates via context-aware decoding. Other application areas are context-sensitive word sense disambiguation, expansion of acronyms, structural disambiguation, parsing with statistical models and text categorization or classification.
In the context of such applications, there is a need to represent considerable amounts of lexical information, often in form of frequencies of joint occurrences of word pairs and n-tuples. Many natural language processing tasks require the estimation of linguistic probabilities, i.e. the probability that a word appears in a certain context. Typically, such context also contains lexical information so that the probability needs to be estimated that an unknown word that is supposed to be in a given relation to a given word will turn out to be identical to a certain other word. Quite often, relevant probability models involve more than two words, such as the classic trigram models in speech recognition, where the probability of a word is conditioned on the two words on the left side of it.
Research on language modeling was so far mostly focused on the task of speech recognition. In such a mainly interactive application, it seems justifiable to restrict the attention to some 10,000 of the most frequent words in the given application domain and to replace words that are too rare by some generic placeholder. However, as the attention shifts towards applications with larger vocabulary size, such as lexical language modeling for optical character recognition, stochastic parsing and semantic disambiguation for open domains, or modeling the word usage in the Internet, it becomes more important to be able to push the size limit of the vocabulary that can be incorporated in a stochastic model.
The use of such models requires the storage of very many possible tuples of values together with the frequencies in which these tuples appeared in the training data. To make the use of such models practical, the storage scheme should allow retrieval of the frequency associated with a given tuple in near constant time.
Whereas a fairly straightforward encoding in standard data structures, such as hash-tables is sufficient for the research and optimization of the statistical models, it is clear that any inclusion of these models into real products will require very careful design of the data structures so that the final product can run on a standard personal computer equipped with a typical amount of main memory. Storage space can be saved at the cost of accuracy, for instance by ignoring tuples that appear only once or ignoring tuples that involve rare words. However, rare words or tuples do play an important role for the overall accuracy of such models and their omission leads to a significant reduction in model quality.
Presently, several techniques have been developed which can be used for encoding linguistic frequency data.
One of these technologies is the Xerox Finite-State Tool described in Lauri Karttunen, Tamás Gaál and André Kempe, “Xerox Finite-State Tool”, Technical Report, Xerox Research Center Europe, Grenoble, France, June 1997. Xerox Finite-State Tool is a general-purpose utility for computing with finite-state networks. It enables the user to create simple automata and transducers from text and binary files, regular expressions and other networks by a variety of operations. A user can display, examine and modify the structure and the content of the networks. The result can be saved as text or binary files. The Xerox Finite-State Tool provides two alternative encodings, developed at XRCE (Xerox Research Center Europe) and at PARC (Xerox Palo Alto Research Center), the latter being based on the compression algorithm described in U.S. Pat. No. 5,450,598.
Another technique is the CMU/Cambridge Statistical Modeling toolkit described in Philip Clarkson and Ronald Rosenfeld, “Statistical Language Modeling Using the CMU-Cambridge Toolkit”, Proceedings ESCA Euro Speech, 1997. The toolkit was released in order to facilitate the construction and testing of bigram and trigram language models.
For obtaining an exemplary, very large data set, 10,000,000 word trigram tokens from a technical domain were picked from the patent database of the European Patent Office. All non-alphanumeric characters were treated as token separators. In this example, a number of N=3,244,038 different trigram types could be identified. The overall vocabulary size, i.e. the number of different strings, was 49,177.
When encoding this data with the Xerox Finite-State Tool it could be compressed to a size of 4.627 bytes per entry in the file-based representation. However, if these data structures are loaded into memory they are expanded by a factor greater than 8, which renders the representation difficult to use in a large-scale setting.
The compressed representation based on the PARC encoding is loaded into memory and used as is, which renders it in principle more attractive for the run time. However, practical tests showed that the current implementation does not support data sets consisting of word trigrams taken from large corpora.
Finally, the CMU/Cambridge Statistical Language Modeling Toolkit stores information for the trigrams, which refer to the bigram information. Thus, the toolkit stores 996,766 bigrams and 3,244,035 trigrams (plus some additional information, such as smoothing parameters) into a binary representation that has 26,130,843 bytes. This means that, although the documentation states that eight bytes are required per bigram and four bytes per trigram, actually, 8 bytes are required for storing a trigram, and there is no appropriate way to use the toolkit to store only the trigram counts.
Thus, the XRCE representation obtained by the Xerox Finite-State Tool expands to large data structures when loaded into main memory. The PARC representation does not support very large networks, i.e. several millions of states and arcs. Finally, the CMU/Cambridge Language-Modeling toolkit requires effectively about eight bytes per trigram type. This means that the prior art technologies either do not support large-scale statistical language models, i.e. they do not scale up to the required amount of data, or they offer inferior compression.