The present invention relates to techniques for obtaining answers to queries, where the answers include items of data that have been cached for semantic regions.
Tavakoli, N., and Ray, A., xe2x80x9cA New Signature Approach for Retrieval of Documents from Free-Text Databasesxe2x80x9d, Information Processing and Management, Vol. 28, No. 2, 1992, pp. 153-163, describes a signature approach to information retrieval in which the sizes of signature files are dependent on the number of unique symbols in the alphabet. A signature maintains the positional information of characters and allows for Don""t Care Characters to be used in queries.
In client-server information systems, local client memory is largely used to cache data and to minimize future interaction with servers. This data caching is particularly important on the World Wide Web (xe2x80x9cthe Webxe2x80x9d or xe2x80x9cWWWxe2x80x9d) and on similar network-based information retrieval systems in which network traffic and slow servers can lead to long delays in delivery of information. As standard page-based caching is technically impractical on the Web and tuple-based caching has certain limitations, much effort has been spent to cache user queries and answers for possible future reuse. Such techniques are described, for example, in Godfrey, P., and Gryz, J., xe2x80x9cSemantic Query Caching For Heterogeneous Databasesxe2x80x9d, in Proceedings of the 4th KRDB Workshop, xe2x80x9cIntelligent Access to Heterogeneous Informationxe2x80x9d, Athens, Greece, 1997, pp. 6-1 to 6-6.
Query caching is particularly advantageous when the user refines a query quite often, for example, by adding or removing a keyword. In this case, many answer tuples may already be cached and can be delivered to the user right away.
A Web-based query system can contact heterogeneous distributed data repositories or servers, can invoke so-called wrappers to convert user queries into a target query language appropriate for the servers, and can control the data flow from the servers. As data are usually transferred over the network in HTML format, the wrappers can also extract answer tuples from the retrieved HTML files before the final answer is reported to the user and stored in the cache.
A typical Web query is a conjunction of terms. Each term in the query can be a keyword, possibly negated with the operator NOT, and can be applied to one or more attributes (title, author, etc.). In most Web servers, the operator NOT is equivalent to AND NOT to force a query to contain at least one non-negated term.
Semantic caching can manage a client cache as a collection of semantic regions; access information is managed and cache replacement is performed at the unit of semantic regions. Such a technique is described in Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., and Tan, M., xe2x80x9cSemantic Data Caching and Replacementxe2x80x9d, in Proceedings of the 22nd VLDB Conference, Mumbai (Bombay), India, 1996, pp. 330-341. Semantic regions group together semantically related documents covered, for example, by a user query.
In semantic caching, each semantic region can have a constraint formula which describes its contents, a counter of tuples in the contents, a pointer to the set of actual tuples in the cache, and the additional information that is used by the replacement policy to rank the regions.
When a query is posed at a client, it can be split into two disjoint pieces: (1) the portion of the answer available in the local cache, and (2) a remainder query, which retrieves any missing tuples in the answer from the server. If the remainder query is not null (i.e., the query covers parts of the information space that are not cached), the remainder query is sent to the server and processed there.
The invention addresses problems that arise in using a cache that includes items of data for semantic regions.
Although a number of important principles of query caching have been proposed, conventional query caching techniques do not provide efficient methods for evaluating or comparing a query against cached items in obtaining an answer to the query. Moreover, query evaluation is reduced in some techniques to a Datalog query evaluation, which may be computationally hard.
The present invention provides a technique that alleviates problems in evaluating or comparing a query against cached items. The technique can be implemented in a system that includes memory with a set of cache locations. The cache includes, for each of a set of one or more semantic regions, contents with one or more items of data. The technique responds to a query by obtaining a query signature and by using the query signature and region signatures for at least one of the semantic regions in finding one or more semantic regions that are qualified and using the contents of at least one of the qualified regions to obtaining an answer to the query.
Signatures simplify operations that evaluate or compare a query with semantic regions to find regions that are qualified. As a result, the technique can be implemented to obtain simple and efficient query evaluation or comparison, making it possible to perform query evaluation in time that is not greater than linear with the number of cached items of data and may be less than linear.
The technique can be implemented using a system that caches Web queries. Each semantic region""s entry in the cache can include a signature for the region.
For a query or region formula that includes a conjunction of terms, the signature can be obtained by combining signatures for the terms. Each signature can be a binary string, and all can have the same length, so that they can be combined by logical operations.
A query signature can be compared to region signatures stored in the cache. If the comparison indicates that the query is equivalent to or contained within one of the regions, an answer can be obtained from cached items of data. In other cases, a criterion can be applied to find cached items of data that are qualified for the query, i.e., cached items of data that can be re-used immediately in obtaining a partial answer; missing information can be requested by sending a query remainder to a server. The criterion can be a region containment criterion or a criterion that is only met by regions that are likely to have a one-term difference from the query.
The technique can also be implemented in a system that includes a memory with a cache defined as described above and a processor connected for accessing the memory. The processor can obtain the query, use the query to obtain a query signature, use the query signature and at least one region signature to find one or more semantic regions that are qualified, and use the contents of at least one of the qualified regions to obtain an answer to the query, as described above. The system can also include one or more user interface devices through which the processor can provide signals to and receive signals from a user. The processor can obtain the query by receiving signals indicating the query from a user through one of the user interface devices and can also provide the answer to the query to the user through one of the user interface devices. The system can have a connection to the internet, and the query can be a Web query.
The technique can also be implemented in an article of manufacture that includes a storage medium and instruction data stored on the storage medium indicating a sequence of instructions. When the instruction data are accessed by a storage medium access device in a system, the system""s processor, in executing the instructions, can obtain the query, use query to obtain a query signature, use the query signature and at least one region signature to find one or more semantic regions that are qualified, and use the contents of at least one of the qualified regions to obtain an answer to the query, as described above.
The technique has three main advantages. First, it makes it possible to process both of the following cases in the same elegant wayxe2x80x941) when a query is contained in the cache, or 2) when it intersects some regions. As a result, the technique can avoid most cases of tuple duplications, and has a moderate storage requirement. Second, it supports efficient reporting of partial answers and generating of query remainders. Finally, it provides a simple solution for region coalescing and replacement policy problems.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.