1) Field of the Invention
The present invention relates to technology to efficiently gather and store web pages.
2) Description of the Related Art
Today's internet offers various kinds of information some of which may disappear by being changed or moved. Recently, some of the developed countries have started to experimentally perform an activity of gathering, storing, and permanently saving such information on the internet to preserve the cultural property (see Nobuki Hirose, “Save the disappearing web-pages! The web archiving is changing”, Database No. 21, Japan Database Industry Association, Dec. 4, 2002 (“http://www.asahi-net.or.jp/˜ax2s-kmtn/internet/dina.html”)). Another example of such activity is a web archiving system using a web robot, which is disclosed in the web site of “Way Back Machine” (http://www.archive.org/). The web robot gathers web pages on the internet and stores the web pages in a web archive by performing a link analysis. When a web page is stored in the web archive and the web page includes a link that represents another web page (hereinafter, “a linked web-page”), the web robot analyzes the link automatically, traces the link, and gathers the linked web-page. In this manner, the web robot stores the linked web-pages sequentially.
However, although the web robot can analyze a link described in the HTML file, the web robot cannot analyze a link that exists in various types of word-processing documents, application data, or multimedia data on the internet. Moreover, even if the link is described in the HTML file, the web robot cannot analyze the link when the link is generated dynamically by various types of scripts. In such cases, the web robot has a difficulty in gathering the linked web-page automatically.
Consequently, the web page stored in the web archive still has a lot of information that the web robot misses gathering (hereinafter, “missed information”). There is no way to detect a successful gathering of the information, therefore, people need to cover the missed information by seeing the inside of each web page gathered one by one while checking the web pages stored in the web archive.
The conventional technology has a difficulty in finding a right place where the missed information can be acquired, and as a result, the missed web-page, which is a web page that is omitted while being gathered, cannot be acquired efficiently.
To gather the web pages whose links exist in various types of word-processing documents, application data, or multimedia data, the data need to be opened using a corresponding application. The link can be acquired when the data is opened and the link is displayed in the data. On the other hand, if the data cannot be opened and the links is not displayed in the data, the link cannot be acquired and the missed information cannot be recovered.
Moreover, when the information about an in-line image (such as a static image and a moving image), which is generated by script in the HTML file, is missing, it is required to presume where the in-line image is linked by referring to the script described in the source of the HTML file. Additionally, to gather the script web-page, which is linked and generated by the script in the HTML file, the script web-page needs to be acquired from the uniform resource locator (URL) displayed using the web browser by clicking the link.