1. Field of the Invention
The present invention relates to an information filtering apparatus, and more particularly, to an information filtering apparatus for collecting and filtering information relating to a specific theme from the Web and the like, exhaustively and efficiently.
2. Description of the Related Art
A great variety of information exists on the Web, and it requires a great amount of time and effort on the part of a user to filter and collect information that satisfies his/her request from such information. Conventionally, it has been proposed to reduce the burden of information collecting work by using the history of a user's past information filtering or by contriving a technique for filtering information.
Patent Literature 1 proposes a learning apparatus that automatically learns individual user preferences for information provided from electronic information media and the like, from a user's actual evaluation values, and preferentially presents information suitable for individual users by use of the learning result.
Non-Patent Literature 1 proposes a method for filtering information required by a user by repeating, for document collections, a process of clustering at a system side, selection at a user side, then integration and clustering at the system side, and selection at the user side, such as a system clustering document collection based on the content, and a user selecting a cluster, the system integrating and re-clustering the selected content, and the user selecting a cluster.    Patent Literature 1: Japanese Published Unexamined Patent Application No. H09-54780    Non-Patent Literature 1: Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey, “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proc. of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318-329, 1992.
The learning apparatus of Patent Literature 1 is for eliminating the necessity of considering what preference a user has for his/her required information so as to allow easy selection of the information required by the user from a plurality of pieces of information. However, there exist on the Web a large number of pages with overlapping content. In the learning apparatus of Patent Literature 1, this has not been taken into consideration, and therefore when information is collected from the Web by use of this apparatus, a large number of information including pages with overlapping content is to be presented. Because the user must finally browse the large number of information to select required information, there is a problem that the burden of information collecting work on the user increases with an increasing number of information to be presented.
According to the method of Non-Patent Literature 1, a user can filter similar documents by using clustering in the system and select required information. However, in this method as well, as in the learning apparatus described in Patent Literature 1, the existence of documents with overlapping content has not been taken into consideration, and therefore, there is a problem that the burden on the user in final information filtering increases with an increasing number of documents.