1. Field of the Invention
This invention relates to inverted indexes used in text corpora indexing, and particularly to systems and methods for multi-dimensional aggregation.
2. Description of Background
An inverted index is constructed over a given corpus of documents, and consists of two primary structures, 1) a dictionary of all the unique terms in the corpus and, 2) for each term in the dictionary, a list of documents that contain the term. The area of large text indexing is active research space and many advancements have been made over the years toward improving the efficiency, performance and scale of indexes. Yet the general functionality of an index has not changed drastically during that period.
In general, inverted indexes are built to serve very simple Boolean queries, such as “Find all documents that contain the word ‘IBM’”. Indexes respond to queries such as the aforementioned with a subset of the documents that contain the terms, and potentially an estimate of how many other documents also contain the term. Yet the data within an index can be used to provide much more insight than a list of documents for the user to investigate manually. For example, inverted indexes can be used for aggregation of unstructured information across multiple dimensions for large corpora. For example, aggregation could provide a by-email-address count of all e-mail addresses found in the .edu domain. However, current unstructured indexing techniques do not handle aggregation operations well, and current aggregation techniques do not handle unstructured information well.