Recently, concurrent with the development of services on the Internet, an enormous amount of information has become available to users. Thus, it has become ever more important that a search technique be developed whereby information desired by users can be extracted from the huge reservoir of presently available data, and that the information be rearranged as quickly and accurately as possible, employing a form that users find easy to handle.
A conventional technique that takes site references into account is disclosed in reference 1 (“Authoritative Sources In A Hyperlinked Environment”, J. Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms; also mentioned in IBM Research Report RJ 10076, May 1997). According to this technique, an importance level is calculated while taking into account a reference relation (support) on a static structure at a specific moment on the Internet. In this case, a page (Authority) authorized relative to a designated search form, and a page (Hub) including many authorized pages are extracted. Another technique takes site references into account at the word level; for example, topic words are extracted for which clustering is performed, and articles are displayed that are related to individual clusters that are so obtained.
There is also a technique, featuring annotations of web pages, whereby database searches are performed by using, as keys, words that appear on web pages, and whereby references for information or for services are provided. At portals, sites for providing search facilities and information services, such as for news, employ keyword ranking, for example, that corresponds to topics selected by searchers to provide, for users, topic keywords that are manually prepared and are currently popular.
As is described above, the automatic extraction and rearranging of relevant data concerning current topics, so that they can be readily displayed and referenced, is very useful, and for this purpose, several conventional proposals are presently available. But according to the conventional techniques, rather than being automatic, the collection and preparation of information are manual processes, and the referencing provided for information or for services is performed based purely on words; referencing based on facts (sentences) is not satisfactorily performed. For example, annotations for individual words or word sets, such as “A company”, “Linux” and “personal computer”, can be provided for such text as, “A company announced the Autumn model of a personal computer running Linux”; however, no annotation can be provided concerning the facts contained in the text.
Further, on the Internet, there are many sites, such as news sites and technical information sites, that provide and transmit high-quality information, but the information transmitted by each site differs in the range covered, the amount and quality of the available information, and the information selection references, nor can satisfactory objective information always be obtained by sampling the data available at a single site. It is possible that information (a set of information elements that appeared in the site since a specific time) that newly arrived at the site can be collected by periodically crawling registered URLs; however, when multiple sites are registered, the total amount of information available at these sites is overwhelming, and it would be difficult to read all the information within a short time. For example, when 20 IT related sites are registered as crawling destinations, in four days the total newly arrived links could amount to about 800 cases, and the volume of information could become so huge that a user could not easily read it all.
To resolve this problem, methods can be adopted by which importance levels, used to define specific references, can be implemented by employing weighting for individual information elements, and for accordingly establishing differences between display methods that will facilitate the identification of relevant information. As one method, anchor (link) information (a URL and its title) and a text block are obtained from sites for various display forms, and are standardized, so that the handling of information obtained from multiple sites is uniform. However, merely by the standardization of information, since rendering information, such as font sizes and display positions, is removed at individual sites, importance levels can not be determined from the rendering information that is generally employed.
The method used for visually representing importance levels is one that is easily understood by human beings; however, since various descriptive methods are available for HTML, it is not easy for the importance levels of information elements to be automatically calculated. Further, even when importance levels can be calculated, evaluation references can be applied only for specific, pertinent sites, and in general have only limited applicability. Especially, information, such as advertisements and special notices, tends to be displayed at prominent locations at individual sites, and is important only for those sites. Therefore, it would be difficult that generally important information is extracted by referring to the information at a single site.
Additionally, a method may be employed by which information is judged by its timeliness rather than its importance level. However, the immediate topicality of such information does not always match the importance level of the information.
Further, when information elements can be identified that convey the same information, their importance levels can be calculated by examining them to determine whether they are employed by multiple sites, but it is difficult to extract elements that convey the same facts. The simplest method is one by a determination is made as to whether the character strings of the titles of the elements resemble each other. However, many variations are used to represent sentences that convey the same facts, and depending on a decision made merely as to whether character strings match is not always a satisfactory solution. For example, for expressions such as “in this year, November, the following November”, or “a notebook PC, a B5 notebook, a PC” there are many variations, even though the intent is to express the same facts and the same concepts, so that a determination that is based only on whether information element character strings match is not appropriate for the extraction of information elements that convey the same facts.
In addition, depending on the type of site referred to, there may be few or no information elements that convey the same facts. In this case, since a set of important information elements can not be extracted, increased efficiency in the acquisition of information is not possible. But although the concept is not as strictly limited as is the extraction of sets of important information elements, if important articles can be procured by using a filter and selecting a group of sites whereat the content matches the taste of a user, instead of concentrating on newly available information, this would be a useful user service.
The extracted important information elements can be used for generating a summary of a group of sites, and can be also applied for a single site. Especially, when instead of one group of sites an arbitrary number of groups of sites are employed, important information elements extracted in accordance with various preferences can be displayed. Further, when the latest important information is provided as annotation while various on-line documents are being referred to, this would be very valuable. For example, when the latest information about ThinkPad (IBM trademark), such as, “2000 Oct. 18 Announced an office-use notebook PC ‘ThinkPad21’ having an enhanced function”, is dynamically provided while an old article about the machine is being referred to, this would be very useful for a user.
To resolve the above described conventional technical problems, it is one object of the present invention to provide a user valuable information while multiple information sources that experience dynamic changes are periodically observed.
It is another object of the invention to extract elements that convey the same facts, or sets of important words that are referred to on multiple sites, and to visually present these elements, so that information in an easily identified form can be provided for users.