The present invention relates generally to a method for keyword proximity searching in a document database, and more particularly, the present invention optimizes CPU-time and reduces I/O operations associated with keyword proximity searching of a document database.
Keyword proximity searching is an important tool for browsing and selecting relevant documents from a large document database. Proximity searching deals with the location and the distance relationship among keywords in a document. For example, the following is a typical keyword proximity search query: given a set of n keywords, find the documents that have a collection of keywords that are within r words from each other. More generally, keyword proximity search may have the form: find the documents in the database that contain all the n search keywords appearing in the same paragraph, page, or within r words, paragraphs, or pages, from each other by implementing a plane sweep search.
Algorithms for answering proximity search queries for a collection of point objects are covered extensively in the literature. Many of the algorithms described in the literature cover the topic from a computational geometry perspective, where the main interest is to develop algorithms that optimize the execution time (i.e., CPU-time) in terms of the input parameters of the problem. On the other hand, when large collections of objects are involved the disk-I/O cost is significant. As a result, from a database perspective, in addition to optimizing CPU-time, the interest is also in reducing the disk-I/O cost for answering such queries.
The present invention presents a new algorithm for answering keyword proximity search where the words represent objects that are points in the one-dimensional space. The algorithm takes into consideration the fact that the number of words in a document database is typically very large. Hence, it tries to reduce the overhead induced from multiple-retrieval of objects from auxiliary storage. In contrast to other algorithms in the literature, the performance of this new algorithm is optimal. As will be explain below, the overall performance of the algorithm is proportional to the total sizes of the input inverted lists and the size of the reported output tuples.
Assume a given collection of n keywords c1 . . . cn, where each keyword has associated with it a list of instances. Each instance in the list contains the location of the keyword in the document database. For example, let the keyword c1 be xe2x80x9cdogxe2x80x9d, c2 be xe2x80x9ccatxe2x80x9d, and c2 be xe2x80x9cratxe2x80x9d (i.e., n=3), and the instances correspond to the occurrences of the keywords in various locations of the documents in the database. For example, the instances of the keyword xe2x80x9ccatxe2x80x9d are  less than doc1, 15 greater than ,  less than doc1, 123 greater than ,  less than doc2, 25 greater than , and  less than doc7, 115 greater than which implies that the keyword xe2x80x9ccatxe2x80x9d appears in document doc1 at locations 15 and 123, in document doc2 at location 25, and in document doc7 at location 115, where doc1, doc2, and doc7 are unique identifies for documents 1, 2, and 7. It should also be noted that page information can be added as part of the location of the word in the document. This helps in extending the proximity search to refer to pages (e.g., find the documents that have the keywords dog and cat within the same page).
The instances of a given keyword are stored in a linear list, where each entry in the list identifies the relative location of the keyword instance as well as an identifier to retrieve the rest of the keyword""s description. This is also referred to as an inverted list. Keyword instances are stored in inverted lists so that a given keyword can easily be found in one or more documents. Assume that the entries in a given list (that corresponds to the locations of instances of a given keyword) are sorted according to their location. This is a reasonable assumption if we consider that the lists are constructed by traversing the document and inserting, for each keyword that appears in the documents, the entry (doc-id,pos) at the end of the list for the keyword. As an example, the following is possible entry in the inverted list for the keyword xe2x80x9ccomplexxe2x80x9d which appears in documents doc1, doc2, and doc4: [xe2x80x9ccomplexxe2x80x9d]xe2x80x94(doc1, 6), (doc1, 20), (doc1, 50), (doc2, 30), (doc4, 100), (doc4, 130). In this case, the list will be sorted in ascending order of the positions of the keyword in the database. For the purposes of the present invention it does not matter whether the elements are sorted in either ascending or descending order.
In contrast to a conventional nested loop searching algorithm, the goal of the present invention is to reduce the number of times the lists are traversed. The keyword proximity search algorithm of the present invention is based on a plane-sweep approach and performs like a merge join algorithm while trying to take advantage of the nature of the proximity condition. This is especially important for large documents and large document databases where the number of instances in each keyword entry is large and the lists of instances cannot be assumed to fit in main memory.
Accordingly, a proximity search algorithm is proposed for answering proximity search queries in a document database. The present invention alternates between two modes of operation to find the matching output tuples. A plane-sweep mode is used to efficiently search the inverted lists, thereby excluding tuples that do not contribute to the output. Once an output tuple satisfying the proximity search query is detected, the plane-sweep mode is terminated. Because of the nature of the proximity search query, some tuples that are in the proximity of the output tuple also satisfy the proximity condition. Hence, the algorithm switches to a nested-loop mode. The nested-loop mode performs a local nested loop join to enumerate all possible combinations of the keyword instances that satisfy the proximity search query in the neighborhood of the output tuple detected by plane-sweep mode. Upon enumerating all of these output tuples, the algorithm switches back to the plane-sweep mode.
For a more complete understanding of the invention, its objects and advantages refer to the following specification and to the accompanying drawings.