The exemplary embodiment relates to extracting information from n-grams. It finds particular application in document reconstruction from n-gram statistics.
Organizations may see advantages to releasing part of the data they own for reasons of general good, prestige, harnessing the work of those the data is released to or to open access to new resources for financial gain. Often, it is not feasible to release the complete data due to privacy concerns, legal constraints, or economic interest. In such cases, a compromise is to release some statistics computed over the data. For example the statistics released may include n-gram counts for text documents. Here, n-grams are sequences of words of length n words.
Examples of where such information may be used include the release of copyrighted material (for example, the Google Ngram Corpus) and the exchange of phrase tables for machine translation when the original parallel corpora are private or confidential. There has been considerable interest in reconstructing at least part of a document, given the count of all its n-grams. A similar problem is solved routinely in DNA sequencing, by mapping the n-grams into a graph (the de Bruijn graph). An n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn edges, consisting of all possible length-n sequences of the given symbols, where the same symbol may appear multiple times in a sequence. An Eulerian tour can then be found in this graph (a path through the graph which visits every edge exactly once). See, for example, Phillip E. C. Compeau, et al., “How to apply de Bruijn graphs to genome assembly,” Nature Biotechnology, 29(11):987-91, November 2011, hereinafter, Compeau, et al. However, the number of possible Eulerian tours can grow faster than exponentially with the number of nodes, and only one of these tours corresponds to the original document.
It would be desirable to be able to reduce the de Bruijn graph into its most irreducible form, from which larger blocks or sub-sequences of the document can more easily be identified.