The present invention relates to a retrieval system for data distributed on a network.
Many retrieval systems on a network, using robots such as Altavista (http://www.altavista.com/), Lycos (http://www.lycos.com/), and excite (http://www.excite.com/) are present. In these systems, a robot is a software for mechanically collecting information on the network. The collected data are subjected to index table generation (web page data is subjected to morphological analysis, and an index table is prepared and stored in a database). Users can retrieve desired data in the database.
The robot searches for a sentence described by the HTML (HyperText Markup Language) and a plain text and traces link destinations described in the sentence to collet data present on the network. In index table generation, a robot searches, as a retrieval target, for a full text or a part such as a title or an URL.
The database may be a distributed database because it has a very large quantity of data. The distributed database is simply divided due to the very large quantity of data, but is not divided for a specific purpose.
The above retrieval is performed using a keyword. That is, a word supposed to be contained in a sentence to be searched is input to retrieve the target sentence.
A mirror site may be provided to reduce the concentration of access to a popular site and reduce the traffic. For example, in the I-Server (http://www.pointcast.com/products/iserver.html) available from Point Cast Network (PCN), data are periodically prefetched to the PCN main office to manage the mirror site.
In a conventional retrieval system for data distributed on a network, the following problems are posed.
(1) It tends to be difficult to handle an increasing quantity of data.
For example, the number of page data on the WWW (World Wide Web) is supposed to be approximately 40,000,000 or more. The number of page data is expected to exponentially increase in the future. At present, the number of pages and the data quantity per page tend to greatly increase.
When the greatly increasing data are simply divided based on their quantities, it is very difficult to manage the database.
(2) It is difficult to handle data having a high update frequency.
Data to be updated several times a day may possibly fall outside the range of robot retrieval targets in the current retrieval system due to the following reason. Even if the frequently updated data are collected by the robot and subjected to index table generation, the data may often be updated before they are retrieved. In this case, even if a page appearing in the retrieval result is checked, the page is already missing, or the contents of the page are entirely changed. As a result, data having the contents against the will of the user may be undesirably displayed.