Conventionally, a web crawler is a software component configured to make the rounds of information resources on a network such as the Internet or an enterprise network while following links to collect webpages regularly, so as to create a database and an index for a search engine. Normally, the web crawler keeps, as configuration information, URL information on an information resource serving as a starting point for the collection and a URL pattern limiting the range of URLs as a collection target.
According to conventional techniques, before operating the crawler, an administrator considers the configuration of a web site as a target to set the URL pattern as a collection rule, and classifies URLs allowed as the collection target and URLS forbidden explicitly (Non-Patent Document 1, Non-Patent Document 2). Then, the web crawler follows the links while judging whether an URL of a link destination included in an acquired webpage is allowed or not in accordance with the collection rule set by the administrator, so as to collect webpages. The web crawler further makes the rounds regularly to update the database and the index.
In the case where there is a link or a transfer occurs to a webpage that is not designated explicitly by the collection rule, and when the destination page is to be collected, the administrator manually adds to the above-stated collection rule to set a rule so as to include the destination page in the collection target during maintenance, for example.
As described above, the conventional web crawler uses the URL information on an information resource used as a starting point for the collection and the URL pattern limiting the range of URLs as a collection target, so that the target range of URLs can be limited. Another known method of limiting the range of information resources on a network is based on the number of links along the link path or the number of hops.
For instance, for the purpose of rating and filtering of a target page using link path information effectively and appropriately, Japanese Patent Application Publication No. 2003-248696 (Patent Document 1) discloses the technique of storing, in a DB unit, hyperlink information including link path information coupling URLs of the respective reference pages, making a path search unit search the link path information stored in the DB unit based on a target page, making a page score calculation unit perform rating as to whether the target page agrees with a predetermined standard with reference to the link path information stored in the database, and performing filtering of the target page based on this rating result.                [Patent Document 1] Japanese Patent Application Publication No. 2003-248696        [Non-Patent Document 1] N. Alur, T. J Brown, C. Delgado, R. Isaacs, M. Przepiorka, Redbooks “WebSphere Information Integrator OmniFind Edition: Fast Track Implementation”, “Appendix A. Template for topology and configuration information”, “Crawler properties templates”, “Web crawler properties template”, pp. 566-570, [online], published on Jul. 18, 2005, International Business Machines Corporation, [searched on Sep. 29, 2008], Internet<URL=http://www.redbooks.ibm.com/redbooks/pdgs/sg246697.pdf>        [Non-Patent Document 2] “Administering Crawl for Web and File Share Content”, “Preparing for a Crawl”, “Configuring a Crawl”, [online], published in July, 2007, Google Inc. [searched on Sep. 29, 2008], Internet<URL=http://code.google.com/apis/searchappliance/documentation/50/admin_crawl/Preparing.html#confh1>        
As described above, a web crawler makes the rounds of information resources and collects webpages in accordance with a range of collection targets specified by a collection rule, thus keeping a database and an index up to date to enable a search by end-users. However, when there is a link to a webpage not designated explicitly by the collection rule or a transfer to such a webpage occurs, the administrator has to recognize the occurrence of the link or the transfer and then has to manually set the collection rule to make the destination page a collection target as illustrated in FIG. 12, for example, thus increasing burden on the administrator to maintain the collection rule.
Further, in case of changing the setting of the collection rule as described above, if the collection rule is changed in insufficient detail, the collection range might include an unnecessary file also. On the other hand, when the collection rule is set in detail so as to exclude such an unnecessary file, the collection rule obtained will be complicated, thus increasing burden on the administrator to maintain the collection rule. Further, setting the configuration information of a crawler has to follow the full understanding of the site configuration enough to enable distinguishing between necessary pages and unnecessary pages.
As illustrated in FIG. 13, on a web site, there is a webpage including a frame that outputs directly a webpage on another server. In such a page configuration, in order to configure an information resource on this another server as a collection target, the administrator is required to acquire an URL of each frame to set a collection rule. In general, a URL of a frame set only is described in an address bar on a browser, and therefore a source of a webpage has to be viewed or communication analysis has to be performed to set a collection rule additionally, thus requiring administrator's labor.
Further, when the setting of the collection rule is changed manually as described above, the following problem might occur. As illustrated in FIG. 14, even when the relevance of a page, which has been made a collection target by the newly added rule, with a page included in a collection range originally and explicitly is lowered due to a change in the site configuration, unnecessary pages would be collected continuously until this additional rule is appropriately changed or deleted. Then, a processing resource that is to be allocated to collect necessary pages is lost, and the processing efficiency for information collection is degraded in the conventional crawler. Moreover, even when the administrator tries to change or delete the added rule, a change in the site configuration has to be monitored, and the changing or deleting operation has to be conducted manually, thus increasing a load on the administrator to maintain the collection rule.
In order to specify the range of information resources on a network, the technique disclosed in the above-stated Patent Document 1 is available, for example. This technique is to record links among all pages and to decide a target of filtering using the number of pages or the number of links to reach a target page. In the technique of Patent Document 1, a reachable page is judged based on the number of links or the number of hops only, so that a domain configuration such as an in-house network cannot be considered. Further, the overall link configuration has to be kept for the judgment, thus requiring a lot of resources, so that this technique is not sufficient for specifying the range of target information resources in terms of the processing efficiency.
That is, it has been desired to develop a web crawler capable of expanding the collection range up to a flexible and appropriate range and coping with a change that might involve a change in the information resources to be included in the collection target, such as a change in the site configuration.