1. Field of the Invention
The present invention relates to a system and a method for determining importance of an information set in hyper-media information on a network and a recording medium which records an information set and an importance determination program thereof and, more particularly, to an information set importance determination system, an information set importance determination method, and a recording medium recording an information set and an importance determination program thereof which enable labor required for selecting information to be reduced by presenting page information needed by a user as higher-order information in a given page list.
2. Description of the Related Art
Recent remarkable widespread of systems for sending and sharing information using a computer network makes it possible to make use of information in small-sized LAN (Local Area Network) environments and in medium-sized intra-net environments, and moreover, in over-Japan or world-wide environments. These information constitute hyper-media information made up of a page as part of information and links which are relations among pages. Starting with an arbitrary page, pages of volumes of related information can be traced one by one.
One of representatives of the above-described hyper-media information system is WWW (World-Wide Web) whose components are three, HTML (Hypertest Mark-up Language) which defines a page description format, URL (Universal Resource Locator) which defines an address of a page on a network and HTTP (Hypertest Transfer Protocol) as a communication procedure for downloading a page existing at an address indicated by a URL. A system for downloading and reading information pages is called WWW browser (hereinafter referred to as browser) and a system for managing and transferring the information page through the HTTP in response to a request from a browser is called WWW server (hereinafter referred to as server). WWW has already attained a position of an indispensable application in Internet which is a network established all over the world and is said to exist as many as several 100 millions of pages.
It is extremely difficult to take out desired information from among such pages existing extremely widely and in large volumes as WWW, so that various search engines have been developed and put into use. A search engine is composed of a crawler for collecting pages existing on a network by repeating a series of operation of analyzing the HTML of a page on the network and obtaining a URL of a destination of a link established and a retrieval engine for conducting full-text search by using, for example, a keyword, based on a collected page.
Under these circumstances, when more than several millions of pages in domestic Japan, for example, are collected to execute retrieval by a keyword, most of retrieval results will be on the order of hundreds to tens of thousands. Moreover, since these output retrieval results are not in order based on their addresses and contents, browsing of the retrieval results will put heavy labor on a user.
To make a browsing procedure be more efficient, it is necessary to appropriately determine a structure of information to be output to a user, and put the information in order, and select and process the same. In other words, what should be done are setting a range of an appropriate information set, finding a representative page and a main structure of an information set and adopting or rejecting and processing each information in an information set. Here, an information set represents a set of information regarding a specific page set selected by search etc. (page information, link information), which is made up of a representative page and its member pages.
So far, several related art have been proposed in these views. In the following, such related art will be described.
First conventional art is recited in Japanese Patent Laid-Open No. 10-069423. In the conventional information set importance determination system, a directory server which centrally manages page attribute values of hyper-media information existing on a network has a directory information storage unit for managing a page attribute value of hyper-media information, a secondary information generation unit for generating secondary information from the attribute value and a function generated by secondary information. In the conventional information set importance determination system, information sets of hyper-media are generated using a host name and a directory name. In addition, obtained information sets are aggregated using a network structure such as a host name or a domain name. However, since according to the technique, determination of an information set is made taking only a place of a page into consideration, appropriate determination might not be possible depending on a page management policy varying with a server. Also since determination of an information set to be aggregated is made by using a network structure, appropriate determination might not be possible when many information sets exist in the same domain.
Second conventional art is recited in “Proceedings of the Sixteenth Annual International ACM SIGIR Conference”, pp. 116-125, June 1993. In the conventional information set importance determination system, a range of an information set is determined using linkage among pages, particularly, a strength of a link. However, since the technique takes only application to hyper-media information edited by the same editor into consideration and takes only relations among information pages into consideration to determine an information set, application to hyper-media information on a network such as the WWW is difficult.
Third conventional are is recited in “Proceedings of 9th ACM Conference on Hypertext and Hyper-media 98”, pp. 297-298, June 1998. In the conventional information set importance determination method, not only an information set is generated using a host name and a directory name but also its representative page is determined according to an agreement with a file name prepared in advance. By this method, however, other information pages than a file name registered in advance can not be determined as a representative page.
Fourth conventional art is recited in Japanese Patent Laid-Open No. 10-105550. In the art, defining a layered structure among respective pages at the time of generating hyper-media information enables use of information pages with a layered structure among information pages in mind at the time of browsing the information pages. The method, however, is allowed to use only an information structure in hyper-media information edited in advance using the present art and not a structure among information pages edited by other existing methods.
As the fifth conventional art, there is Excite Japan (http://www.excite.co.jp/) which is one of WWW search engines put into practice in recent years. In the present conventional art, grouping keyword retrieval results to be output according to host names to suit user's convenience at the time of browsing retrieval results. The method, however, does not always have an information set and a host coincide with each other and unless a home page is included in the retrieval results, it will not be presented. In addition, when volumes of pages whose contents are almost similar such as messages of a mailing list are included in retrieval results, the pages will be output as they are. Moreover, unless index pages which originally help users in browsing are included in retrieval results, they will not be output.
As indicated in the first and the second conventional art, the first problem is that in the determination of a range of an information set, a range intended by an information creator can not be accurately determined.
The reason is that limited attribute information is used such as only a host name or a directory name, or only linkage.
As indicated in the third and the fourth conventional art, the second problem is that an information creator is not allowed to determine a representative page of a target information set.
The reason is that each page has no attribute value indicative of such an information structure.
As indicated in the fifth conventional art, the third problem is that in a given set of information pages, there is no basis for adopting or rejecting pages.
The reason is that an information structure in an information set is not determined.