1. Field of the Invention
The present invention relates to the field of information retrieval and more specifically to an apparatus for a secure and granular information retrieval index and its method of operation.
2. Discussion of Related Art
A typical full-text index inverts the words in a document to allow efficient lookup of terms from a query. Inverts refers to the organization of word locations under the word rather than the ordinary sequential order of words that occurs in prose. This is analogous to a back-of-book index in an indexed paper book (i.e.; the common human-built index that occurs at the end of most non-fiction conventional physical books, such as would be found on a bookshelf). This full-text index is typically referred to as an inverted index.
An inverted index consists of the token (individual terms in a document), along with their posting lists (word offsets and the document in which they occur). Example document 1 shown in FIG. 1A contains tokens, “elizabeth” at offset 1, “david” at offset 2, etc.
To index both of the documents represented in FIGS. 1A and 1B, the inverted index would be as shown in FIG. 1C. The posting list in the FIG. 1C example is made up of tuples (lists of pairs of related numbers) containing the document number (an identifier that is ascending to handle sequential read) and an offset within that document. Posting lists are typically compressed to save space, and are processed sequentially.
This data structure is used during a full-text query in order to retrieve the documents and positions within documents that satisfy the query. For example, the query “food” would look up token “food”, retrieve the posting list, and return document 1, offset 5, and document 2, offset 4. The position information found in a posting list may be used to jump to an offset within a document, or to perform extended boolean query functions, such as finding phrases, or words within a certain proximity to one-another.
To perform a phrase query such as “elizabeth david”, the query process looks up tokens “elizabeth” (in document 1, offset 1) and “david” (in document 1, offset 2, and document 2, offset 1). Since both words together only occur in document 1, the query process looks for adjacent occurrences; “david” immediately following “elizabeth.” Since “elizabeth” occurs at offset 1 and “david” occurs at offset 2, this condition is satisfied, and a phrase match is returned.
Full-text indexes have a number of security issues. A fundamental issue is that the majority of the contents of the documents in the indexed collection can be recreated from the information in the index alone. To do this, the inverted index is itself inverted. Since token values and positions are all indexed, there is little loss of information. All that is necessary is to understand the structure of the token array and posting list, which is typically a compression of a simple binary encoding.
Reading top-down through the example inverted index in FIG. 1C, after the first row, one constructs: doc 1:“______ david” and doc 2: “david”. After the second index row constructs: doc 1:“______ david” and doc 2: “david eats”. After the third row constructs: doc 1: “elizabeth david” and doc 2: “david eats” and so on to complete the document collection, minus minor formatting information.