1. Field of the Invention
The present invention relates to an inverted index structure and a method of building the same and, more particularly, to a structure of two-level n-gram inverted index (simply referred to as n-gram/2L index) and methods of building the same, processing queries and deriving the index that reduce the size of n-gram index and improves the query performance by eliminating the redundancy of the position information that exists in the n-gram inverted index (simply referred to as n-gram index).
2. Description of Related Art
Searching text data is a very fundamental and important operation and is widely used in many areas such as an information retrieval and a similar sequence matching for DNA and protein databases. DNA and protein sequences are regarded as text data over specific alphabets, e.g. A, C, G and T in DNA. A variety of index structures have been studied aimed at efficiently processing the searching operations for text data, and an inverted index is the most practical index structure widely used (Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999).
The inverted index is a term-oriented mechanism for quickly searching documents containing terms given as a query. Here, the document is a defined sequence of characters and the term is a subsequence of the document. The inverted index fundamentally consists of terms and posting lists (Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, “Compression of Inverted Indexes for Fast Query Evaluation”, In Proc. Int'l Conf. On Information Retrieval, ACM SIGIR, Tampere, Finland, pp. 222˜229, August 2002).
A posting list is related to a specific term. A document identifier, which contains the term, and position information, where the corresponding term occurs, are managed as a list structure. Here, the document identifier and the position information are collectively referred to as a posting.
For each term of t, there is a posting list that contains postings <d, [o1, . . . , of]>, wherein d denotes a document identifier, [o1, . . . , of] is a list of offsets o, and f represents the frequency of occurrence of the term t in the document (Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, “Compression of Inverted Indexes for Fast Query Evaluation”, In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Tampere, Finland, pp. 222-229, August 2002).
The postings in the posting list are usually stored in the order that the document identifiers d increase and the offsets in the posting are stored in the order that the offsets o increase in order to facilitate the query process. Besides, an index such as the B+-tree is created on the terms in order to quickly locate the posting list. FIG. 1 shows a structure of the inverted index.
The inverted index is classified into a word-based inverted index using a word as a term and an n-gram index (simply referred to as n-gram index) using an n-gram as a term according to the kind of terms (Ethan Miller, Dan Shen, Junli Liu and Charles Nicholas, Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System, Journal of Digital Information 1(5), pp. 1˜25, January 2000).
The n-gram index is an inverted index by extracting n-grams as indexing terms. An n-gram is a string composed of n consecutive characters extracted from d, when a document d is given as a sequence of characters c0, c1, . . . , cN−1.
Extracting all n-grams from the given document d in order to build an n-gram index can be done via a 1-sliding technique, i.e., sliding a window composed of n consecutive characters from c0 to cN−n and storing the characters located in the window. Accordingly, the ith n-gram extracted from d is the string of ci, ci+1, . . . , ci+n−1.
FIG. 2 is an example of an n-gram index created from a set of given documents, wherein n=2. FIG. 2A shows the set of documents and FIG. 2B shows the n-gram index created on the documents.
Processing a query using the n-gram index is carried out in two steps: (1) splitting a given query string into multiple n-grams and searching posting lists of those n-grams; and (2) mergeing the posting lists (Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999).
For example, a query for searching documents containing a string of “BCDD” in the n-gram index in FIG. 2 includes the following two steps.
In the first step, the query “BCDD” is split into three 2-grams “BC”, “CD” and “DD” and the respective 2-grams are searched in the inverted index.
In the second step, the posting lists corresponding to the respective 2-grams are merge-joined with the document identifiers and the documents where the three 2-grams “BC”, “CD” and “DD” occur consecutively to constitute “BCDD” are searched. Since the three 2-grams occur consecutively in the documents 0 and 2 in the scanned posting lists, the query result is the document identifiers 0 and 2.
The n-gram index has language-neutral and error-tolerant advantages (Ethan Miller, Dan Shen, Junli Liu and Charles Nicholas, Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System, Journal of Digital Information 1(5), pp. 1˜25, January 2000).
The language-neutral advantage means that it does not need linguistic knowledge since the index terms are extracted in a mechanical manner.
For such characteristics, the n-gram index has been widely used for Asian languages, where complicated linguistic knowledge is required, or for DNA and protein databases, where the concepts of words are not definite.
The error-tolerant advantage denotes that it can retrieve documents even though the documents have some errors, e.g., typos or miss-recognition by the OCR software since the n-grams are extracted in the 1-sliding technique.
Accordingly, the n-gram index has been effectively used for the applications to searching documents that allow errors such as an approximate string matching.
Nevertheless, the n-gram index has also some drawbacks in that the size of the index becomes large and the process of queries requires a long time (James Mayfield and Paul McNamee, Single N-gram Stemming, In Proc. Int'l Conf. On Information Retrieval, ACM SIGIR, Toronto, Canada, pp. 415˜416, July/August 2003).
These drawbacks result from the characteristics of the method of extracting terms, i.e., the 1-sliding technique.
The 1-sliding technique drastically increases the number of n-grams extracted, thus increasing the size of the n-gram index.
Moreover, it has a drawback in that it takes a long time to process queries, since the number of the postings to access during the query processing increases.