1. Field of the Invention
This invention relates to an apparatus, method, and computer program product for information filtering, in a computer system receiving a data stream from a computer network.
2. Description of the Relevant Art
Recent developments in computer networking, particularly with regard to global computer internetworking, offer vast amounts of stored and dynamic information to interested users. Indeed, some estimate that hundreds of thousands of news articles stream through the global internetwork each day, and that the total number of files transferred through the global internetwork (hereinafter "network") is in the millions. As computer technology evolves, and as more users participate in this form of communication, the amount of information available on the network will be staggering.
Although databases are relatively static and can be searched using conventional network search engines, current information filtering schemes are ill-suited to thoroughly search the massive, dynamic stream of new information passing through the network each day.
Presently, the information is organized, if at all, to the extent that only skilled, persistent, and lucky, researchers can ferret out meaningful information. Nevertheless, significant amounts of information may go unnoticed. For example, because most existing information filtering schemes focus on locating textual articles, information in other forms--visual, audio, multimedia, and patterned data--may be overlooked completely. From the perspective of some users, a few items of meaningful "information" can be obscured by the volume of irrelevant data streaming through the network. Often, the information obtained is inconsistent over a community of like-minded researchers because of the nearly-infinite individual differences in conceptualization and vocabulary within the community. These inconsistencies exist with both the content of the information and the manner in which a search for the content is performed. Furthermore, the credibility of the author, the accuracy, and quality of a given article's content, and thus the article's "usefulness," often are questionable.
The problem of information overload can be more acute for persons involved in multidisciplinary endeavors, e.g., medicine, law, and marketing, who are charged with monitoring developments in diverse professional domains. There are many reasons why users want to communicate with each other about specific things as they find networked resources. However, drawing attention to articles of common interest to a community of researchers, or workgroup, often requires a separate intervention, such as a telephone call, electronic mail, and the like.
Often, membership in a workgroup or community is sharply defined, and workers in one physical community may be unaware of interesting developments in other workgroups or communities, whether or not the communities are similar. This isolation may be at the expense of serendipitous discoveries that can arise from parallel developments in unrelated or marginally-related fields.
Adding to the complexity of the information filtering problem is that an individual user's interests may shift over time, as may those of a community, and many existing information filtering schemes are unable to accept shifts in the individual's interest, the community's interest, or both. Furthermore, information flow usually is unidirectional to the user, and little characterization of secondary user, or group, interests, e.g., the consumer preferences of users primarily interested in molecular biology or oenology, is derived and used to provide targeted marketing to those users/consumers, and to follow changing demographic trends.
Typically, identifying new information is effected by monitoring all articles in a data stream, selecting those articles having a specific topic, and searching through all of the selected articles, perhaps thousands, each day. One example is where users interact with a web browser to retrieve documents from various document servers on the network. Given the increasing impracticality of this brute-force approach, the heterogenous nature of "information" on the global internetwork, and the growing complexity of social interactions that are evolving concurrently with networking technology, there have been several attempts to address some of the foregoing problems by using adaptive information filtering systems.
In one approach, the information filtering is geared toward content-based filtering. Here, the information filtering system examines the user's patterns of keywords, and semantic and contextual information, to map information to a user's interests. This approach does not provide a mechanism for collaborative activities within a group.
Another approach uses intelligent software agents to learn a user's behavior, i.e., "watching over the shoulder," regarding certain types of textual information, for example, electronic mail messages. In this scheme, the agents offer to take action, e.g., delete the message, forward it, etc., on the basis of the user's prior responses to the content of that certain information. Also, this scheme provides a minor degree of inter-agent collaboration by allowing one agent to draw upon the experience of other agents, typically for the purpose of initialization. However, each agent is constrained to develop its expertise in a particular domain within the limited range of the type of information. Also, the passive feedback nature of the "over-the-shoulder" approach can place an unacceptable burden on the system's learner, reducing information throughput and decreasing the efficiency and usefulness of the overall system. Also, systematic errors can be introduced into the passive feedback error, and the actual response of the user may be misinterpreted.
Another approach uses content-based filtering to select documents for a user to read, and supports inter-user collaboration by permitting the users in a defined group to annotate the selected documents. Annotations tend to take as many forms as there are users, placing the emphasis on characterizing, maintaining, and manipulating a group of diverse annotations, or "meta-documents," from different users in conjunction with the original document. Collaboration is achieved by enabling the filters of other users to access the annotations. While this approach is useful to the extent that other users can receive a deeper understanding of the comments and criticism provided by a particular user, the costs include the additional computer effort required to implement such collaboration over large, diverse groups and, more importantly, the extra time required for each user to review the comments and criticism of the annotations of the others. Also, annotation sharing and filtering are hampered by the variety in vocabulary and conceptualization among users.
Yet another approach employs collaborative filters to help users make choices based on the opinions of other users. The method employs rating servers to gather and disseminate ratings. A rating server predicts a score, or rating, based on the heuristic that people who agreed in the past will probably agree again. This system is typically limited to the homogenous stream of text-based news articles, does little content-filtering, and can not accommodate heterogenous information.
Other projects have explored individual features such as market-trading optimization techniques for prioritizing incoming messages; rule-based agents for recognizing user's usage patterns and suggesting new filtering patterns to the user; and personal-adaptive recommendation systems using exit-questions for rating documents and creating shared recommendations; and the like. In each case, the collaborative and content-based aspects of information filtering are not integrated, and the filters are not equipped to deal with heterogenous data streams.
Many information filtering systems use a weighted average technique for user information feedback that, for example, extracts all of the ratings for an article and takes a simple weighted average over all of the ratings to predict whether an article is relevant to a particular user. Simple weighted averaging, however, tends to destroy the information content contained in the ratings, unless a relatively sophisticated approach is used for the functions generating the simple weighted averages. Little impact is given to factors such as credibility, personal preferences, and the like, which factors tend to be irreversibly blurred during the averaging process. Simple weighted averages, then, can be lacking when one desires to develop information filters that are well-fitted to a particular community and the specific interests of a user unless innovative methods are employed to preserve at least some of the relevant information.
What is needed then is an apparatus and method for information filtering in a computer system receiving a data stream from a computer network in which entities of information relevant to the user, or "informons," are extracted from the data stream using content-based and collaborative filtering. Such a system would employ an adaptive content filter and an adaptive collaborative filter which are integrated to the extent that an individual user can be a unique member client of multiple communities with each community including multiple member clients sharing similar interests.
The system also would implement adaptive credibility filtering, providing member clients with a measure of informon credibility, as judged by other member clients in the community. The system also may implement recommendation filtering and consultation filtering. Furthermore, the system would be preferred to be self-optimizing in that the adaptive filters used in the system would seek optimal values for the function intended by the filter, e.g., collaboration, content analysis, credibility, etc.
3. Citation of Relevant Publications
In the context of the foregoing description of the relevant art, and of the description of the present invention which follows, the following publications can be considered to be relevant:
Susan Dumais, et al. Using Latent Semantic Analysis to Improve Access to Textual Information. In Proceedings of CHI-88 Conference on Human Factors in Computing Systems. (1988, New York: ACM)
David Evans et al. A Summary of the CLARIT Project. Technical Report, Laboratory for Computational Linguistics, Carnegie-Mellon University, September 1991.
G. Fischer and C. Stevens. Information Access in Complex, Poorly Structured Information Spaces. In Proceedings of CHI-91 Conference on Human Factors in Computing Systems. (1991: ACM)
D. Goldberg, et al. Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM, 35, 12 (1992), pp. 61-70.
Simon Haykin. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, N.J. (1986), pp. 100-380.
Simon Haykin. Neural Networks: A Comprehensive Foundation. Macmillan College Publishing Co., New York (1994), pp. 18-589.
Yezdi Lashkari, et al. Collaborative Interface Agents. In Conference of the American Association for Artificial Intelligence. Seattle, Wash., August 1994.
Paul Resnick, et al. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proceeding of ACM 1994 Conference on Computer Supported Cooperative Work. (1994: ACM), pp. 175-186.
Anil Rewari, et al. AI Research and Applications In Digital's Service Organization. AI Magazine: 68-69 (1992).
J. Rissanen. Modelling by Shortest Data Description, Automatica, 14:465-471 (1978).
Gerard Salton. Developments in Automatic Text Retrieval. Science, 253:974-980, August 1991.
C. E. Shannon. A Mathematical Theory of Communication. Bell Sys. Tech. Journal, 27:379-423 (1948).
Beerud Sheth. A Learning Approach to Personalized Information Filtering, Master's Thesis, Massachusetts Institute of Technology, February, 1994.
F. Mosteller, et al. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Springer-Verlag, New York (1984), pp. 65-66.
T. W. Yan et al. Index Structures for Selective Dissemination of Information. Technical Report STAN-CS-92-1454, Stanford University (1992).
Yiming Yang. An Example-Based Mapping Method for Text Categorization and Retrieval. ACM Transactions on Information Systems. Vol. 12, No. 3, July 1994, pp. 252-277.