The present invention relates to a method of checking the validity of a set of digital information contained in cache means connected to an information source by comparing validity information associated with said set of digital information and validity information associated with corresponding digital information in said information source.
The invention also relates to a method of and an apparatus for retrieving digital information from an information source having cache means connected thereto.
Retrieving digital information from an information source, e.g. one or more databases, is widespread and already used by most companies deploying information technology equipment, e.g. in client-server systems connecting users to databases. Typically, the users request information regarding customers, products, logistics, economy or other subjects. Also the emerging e-trade solutions are based on customers client nodes connected to database servers via the Internet.
When a client retrieves data from an information source or remote databases the speed at which digital information or data are fetched is given by a number of parameters. Cached data that are located in close proximity to the client will be fetched fast because they do not have to travel a long way on a crowded network. Data that must be fetched from a remote database will have to be retrieved from the database disk and managed by the database server CPU. Thus the main parameters are distance, network bandwidth, database-server capability and database storage media access speed.
Client-server communication is often implemented in a not very optimal way as the software used is often designed for maximum functionality and flexibility, not necessarily optimum performance. Many tools will supply the requested information, yet at the same time consume substantial resources on the computers involved (CPU, network, etc.), e.g. many tools will start a new process to serve a user request.
The usual method in order to avoid requesting the same data twice is a cache. A cache is generally a temporary storage of information nearer to the point of use than the original location. For example, in a web system data files from the web server are typically cached on the client; e.g. a price list or a map showing the supplier""s address. This is a standard feature of most HTML (HyperText Markup Language) browsers. The next time the client needs the information, it is accessed faster because it can be read from the cache, e.g located on the client hard disk or in RAM, instead of having to be transferred across the network, e.g. the Internet.
One of the important issues of caching is the fact that the information stored in cache, e.g. on a client, will not automatically be updated upon server updates. Therefore, prior to retrieval of digital information located in a cache memory, a validity check has to be performed in order to determine whether the cached digital information is outdated or not. An implementation is described in patent application WO 97/21177. This document describes the use of time stamps on cache data and on index data in the database in order to perform the validity check. The validity check is performed by comparing a time stamp associated with a respective cache database entry and a time stamp associated with the index to the corresponding data entry in the master database. This document also describes different data locking methods to deal with real time update of data that is accessed by many clients.
Notifying the client that previously fetched data have been changed can also be performed by tracking the information on the server and sending notification about updated data to the client. Such a method is described in U.S. Pat. No. 4,714,992: xe2x80x9cCommunication for version management in a distributed information servicexe2x80x9d. This patent describes a system with a updated master database and a replica database holding the same information as the master database. Only the master database is updated and valid at all times. When the replica database is going to be updated, the replica database sends a query to the master database for identifiers of obsolete records. This allows the replica database to redirect client queries for the obsolete data to the master database until the replica database has been updated.
The U.S. Pat. No. 5,842,216: xe2x80x9cSystem for sending small positive data notification messages over a network to indicate that a recipient node should obtain a particular version of a particular data itemxe2x80x9d describes a system in which a small message is sent from the database server to the recipient notifying the recipient that data have been updated. The massage includes a time stamp, the data location and a check sum of the data held in that particular location. Based on the time stamp, the recipient can determine whether updated data should be fetched. Based on the location and the check sum, the recipient can look for the data in a local cache if a cache is available.
Version control at the query level is described in U.S. Pat. No. 5,892,914: xe2x80x9cSystem for accessing distributed data cache at each network node to pass requests and dataxe2x80x9d. This patent describes a method of connecting multiple servers each storing a fraction of a total cache, a Network Distributed Cache. When a client needs information, the client sends a query to one of the cache servers, and the query and the data held in that particular server are passed on to another server for completion. If all the requested data are fetched, the data are sent to the client. Otherwise, the query and the data are sent to the next cache server. This method results in a large number of version numbers, because the data object (e.g. a browser request from the client to the application server) is identified by all the parameters included in the request.
Considering a scenario where a query for products is made for a certain country, product group and date, the cache on the application server will have to contain data for all combinations of these parameters, a phenomenon that is known as tile Cartesian product. For 50 countries, 100 product categories and 100 dates the cache would contain 50*100*100=500,000 entries. Generally, N parameters and M values per parameter result in MN cache entries. This problem is inherent to all methods which identify a query result using a single version number.
Version control at the database level, as described in U.S. Pat. No. 5,893,117: xe2x80x9cTime-stamped database transaction and version management systemxe2x80x9d, deals with storing many versions of the same data entry, giving each entry a time stamp and building a data structure that allows the clients to track any version of a given data entry. The described system facilitates version control in a data base environment that is updated simultaneously by many users, e.g. software development projects where a number of developers write new codes for the same program. As a result, a large number of objects (records) has to be tracked For similar requests, e.g. querying a product price for different categories and different dates in the same country, the update status would have to be queried for every category and date.
Further, application specific programs have been developed, optimizing requests by caching data in program memory. This approach has two disadvantages: it requires the application program to be present on the client (and therefore application specific codes to be loaded, as an applet, through the network) and it increases program complexity and thus development time.
An object of the present invention is to provide a method of checking the validity of a set of digital information contained in cache means connected to an information source, as described in the introductory part in claim 1, which enables a more rapid and less memory requiring validity check compared to known methods.
According to the invention the object is achieved by a method of the above-mentioned type which is characterized by
specifying two or more overlapping supersets of information having said set of digital information as a common subset; and
performing said validity check by comparing validity information associated with one or more of said supersets in said cache means and validity information associated with corresponding supersets in said information source.
Hereby, as validity information is not related to the single items but relates to supersets typically containing a plurality of items or elements, the amount of validity information to be stored and maintained is reduced compared to the use of prior art methods. As a result, the amount of memory needed for storing validity information in said information source and in said cache means is reduced, i.e. the use of the memory is optimized as more digital information and corresponding validity information can be stored in a given amount of memory. As a consequence, the maintenance of the reduced amount of validity information can be performed rapidly. In addition, compared to known methods, when said set of digital information in said cache means includes a plurality of elements, a reduction of the validity information to be transferred between said information source and said cache means during a validity check, is reduced. Hereby, the information traffic between said information source and said cache means, e.g. via a relatively slow network, is reduced. Hence, the validity check can be performed rapidly compared to known methods.
An expedient embodiment of a method according to the invention is characterized in that said sot of digital information is identified as valid when at least one of said supersets is valid. This embodiment of the invention is based on the fact that said set of information can be determined to be valid when a single one of the corresponding supersets is found to be valid, because said set of digital information in said cache means is a common subset, i.e. an overlap or intersection of said supersets. As a consequence, the validity check can be stopped as soon as a single superset is determined to be valid and hereby the speed of the validity check is optimized. As two or more overlapping supersets have been specified, the chance of finding at least one superset indicating the information to be valid is increased.
Advantageously, said one or more supersets to be used for said validity check are selected from said specified supersets on the basis of a priori knowledge of supersets least likely to be updated. In another advantageous embodiment, said one or more supersets to be used for said validity check are selected from said specified supersets on the basis of obtained knowledge of supersets having been updated least frequently. Both of these embodiments are advantageous as the validity check can be based on supersets which will often be found to be valid, and as a consequence the information can rapidly be concluded to be valid. Preferably, the supersets are selected to be used in a prioritised order, i.e. a superset least likely to be updated is used first in said validity check.
As mentioned above, the invention also relates to a method of retrieving digital information from an information source having cache means connected thereto, said method comprising the steps of:
receiving a query specifying the digital information to be retrieved;
checking if said cache means holds query result information associated with said query, and in the affirmative performing a validity check of said query result information;
retrieving, if said cache means does not hold valid query result information associated with said query, valid query result information from said information source and updating said cache means with said retried valid query result information; and
presenting said valid query result information as a result of said query.
The method according to the invention is characterized in that said checking includes specifying two or more overlapping supersets of information having query result information associated with said query as a common subset; and said validity check is performed by comparing validity information associated with one or more of said supersets in said cache means and validity information associated with corresponding supersets in said information source.
Hereby, as mentioned above, the speed of the validity check may be improved because of the reduction of validity information for given query result information associated with said query which reduces the amount of validity information to be compared. The amount of validity information to be transmitted between said cache means and said information source in connection with validity checks and in connection with validity information updating in said cache means is reduced, resulting in a reduced load on the system. As a consequence, the overall speed of retrieval of digital information from the information source having cache means connected thereto is improved. By reducing the amount of validity information needed for a validity check, the memory required is also reduced.
An expedient embodiment of a method according to the invention is characterized in that query result information in a common subset is identified as valid when at least one of said supersets is valid. Hereby, the speed of the validity check is optimized as the validity check can be stopped, i.e. the information in said cache is found to be valid, as soon as a single superset is determined to be valid. As a consequence, the overall speed of retrieval of digital information from the information source having cache means connected thereto is improved.
Advantageously, said one or more supersets to be used for said validity check are selected from said specified supersets on the basis of a priori knowledge of supersets least likely to be updated. In another advantageous embodiment, said one or more supersets to be used for said validity check are selected from said specified supersets on the basis of obtained knowledge of supersets having been updated least frequently. These embodiments are advantageous as the validity check can be based on supersets which are often valid, and, as a consequence, the validity check can often be performed rapidly. As a consequence, the overall speed of retrieval of digital information from the information source having cache means connected thereto is improved even further.
Finally, the invention relates to an apparatus for retrieving digital information from an information source having cache means connected thereto, said apparatus comprising:
input means adapted to receive a query specifying the digital information to be retrieved;
checking means adapted to check if said cache means holds query result information associated with said query, and in the affirmative performing a validity check of said query result information;
updating means adapted, if said cache means does not hold valid query result information associated with said query, to retrieve valid query result information from said information source and to update said cache means with said retried valid query result information; and
output means adapted to present said valid query result information as a result of said query.
The apparatus according to the invention is characterized in that said checking means is adapted to specify two or more overlapping supersets of information having query result information associated with said query as a common subset; and to perform said validity check by comparing validity information associated with one or more supersets in said cache means and validity information associated with corresponding supersets in said information source.
An expedient embodiment of an apparatus according to the invention is characterized in that said checking means is adapted to identify query result information in a common subset as valid when at least one of said supersets is valid.
Advantageously, said checking means is adapted to select said one or more supersets to be used for said validity check from said specified supersets on the basis of a priori knowledge of specified least likely to be updated. In another advantageous embodiment, said checking means is adapted to select said one or more supersets to be used for said validity check from said specified supersets on the basis of obtained knowledge of supersets having been updated least frequently.
It is noted that an apparatus according to the invention has the same advantages as mentioned in connection with the corresponding embodiments of the method.