The present invention relates generally to systems and methods for querying documents against persistent queries and more particularly to a system for querying incremental document changes in a persistent query system.
The amount of information generated, managed, retrieved, and so on is expanding at an exponential rate. As a result, tools for managing the information are gaining significance as users attempt to control and harness the information contained in documents, web pages, and the like. For example, increasingly, business entities are instituting document management systems that facilitate the control and sharing of documents generated by their users. Such systems employ xe2x80x9celectronic filteringxe2x80x9d techniques to assist users in sorting through the massive amounts of information.
A key aspect of such systems is a mechanism that enables users to submit queries that are compared to properties of documents managed by the document management systems. As it turns out, document management systems, while generally built upon database technology, exhibit usage characteristics that can be exploited to enhance system performance. For example, many users submit queries that remain persistent such that as new documents are generated and entered into the system, the new documents are compared to previously submitted queries. Thus, the queries are stored and compared against a stream of incoming documents. The user queries generally consist of one or more search terms (or document properties) connected by one or more Boolean operators. This task is alternatively referred to as xe2x80x9cselective dissemination of information,xe2x80x9d or as the xe2x80x9cinverted query problem.xe2x80x9d
The primary difficulties in document filtering arise from the massive scale of queries to be evaluated against the high frequency of incoming documents to be filtered. For example, news filtering on the Internet may involve dealing with a stream of potentially many, many documents per second, with each document being filtered against millions of user queries. Comparing each document against each query is impractical, as providing the hardware capable of performing acceptable throughput is cost-prohibitive.
As a result, known filtering systems reduce the time taken to examine a document by first eliminating all queries which are irrelevant to a given document. For example, a relatively rapid test can be performed to eliminate any monotone logic queries (queries which do not contain non-monotone operators such as negation) which consist solely of words which are not in the document. Moreover, statistical techniques can speed up the detection of a query""s relevance or irrelevance to a particular document.
However, after the irrelevant queries have been eliminated, the document still must be tested against the (possibly large) number of remaining queries, and such testing is relatively slow. Moreover, if non-monotone operators are allowed, initially eliminating queries having non-monotone operators significantly complicates the pre-filtering elimination process.
Other systems compile sets of user queries into acyclic graphs. The acyclic graph technique lists all search terms as endpoints (i.e. sources) in the graph and combines the set of user queries into a hierarchy of query nodes. The acyclic graph reduces redundancies by combining a set of queries into a single query. Thereafter, a document is scanned for terms matching the acyclic-graph source nodes. The entire set of queries comprising the graph is then substantially simultaneously evaluated. As a result, a document need only be scanned once for matching query terms. Unfortunately, when a document is edited, current systems require that the entire document be rescanned and the entire acyclic-graph query to be re-evaluated. When the document is large and the combined acyclic-graph query complex, the processing time is significant. This is so even where the editorial changes to the document are relatively minor. Re-filtering the entire changed document against the queries consumes system resources and degrades system performance.
A goal of the present invention is to overcome the drawbacks of the prior art by presenting methods and apparatus that reduce the processing resources required to re-filter an edited document.
The present invention provides for a system and method for analyzing changes to a document. The system analyzes the incremental changes to the document against the user queries without requiring the entire document to be re-analyzed. After a document has been once analyzed by the system, subsequent changes only require that a small subset of the document be reprocessed. The analysis of the small subset is facilitated by maintaining an incremental-results data set for each document.
The first step in the analysis is to generate a dictionary of terms from user submitted queries. The second step is to generate an incremental-results data set that reduces the document to a set of words that match dictionary terms. Subsequent changes to the document allow the changes to be compared to the dictionary terms. The resulting set of changes as cross-referenced by the dictionary such that only the words deleted or added to the changed portion are used to update the incremental-results set.
According to another aspect of the invention, the number of queries evaluated after document changes are made can be reduced as well. The invention recognizes that many documents undergo changes in particular phases of development. As a result, queries may be selectable based on document phase as well. When a document undergoes changes, the document changes may only need to be filtered against the set of queries relevant to a particular phase.
Other aspects of the present invention are described below.