1. Technical Field
The present invention relates generally to information filtering, and more particularly, to a system for creating a model of user preferences for content based on a user's history of selecting text articles for viewing in an online environment.
2. Statement of the Problem
The huge amount of information available at any one time in the evolving world wide information infrastructure, and particularly the volume of information accessible via the Internet, can easily overwhelm someone wishing to locate specific items of this information. Although it is advantageous to have such a large quantity of information available, only a small amount of it is usually relevant at any time to a given person. In order to provide a manageable volume of relevant information, an intelligent filtering system that ‘understands’ a user's need for specific types of information is invaluable. A user model which captures the user's preferences for information is thus required. Many methods are known in the art for creating such a user model with varying degrees of intrusiveness and effectiveness.
One presently known way of reducing this tremendous volume of information to a relevant and manageable size is to use an ‘information filtering agent’ which can select information according to the interest and/or need of a user. However, at present, few information filtering agents exist for the evolving world wide multimedia information infrastructure, and particularly for the Internet.
Historically, user modeling has been applied to information filtering in the literature and in practice. This modeling has become important commercially with the advent of the Internet. The Internet makes possible access to information, product sales, services and communication for anyone with access thereto. However, the Internet presents an overwhelming amount of information and a large number of items to purchase. It is thus difficult for a human to sort through this tremendous volume of Internet content without some help from a filtering or recommendation service. Therefore, ‘personalization’ of Internet content and advertising is needed to reduce the myriad of choices down to a manageable number for a given individual.
All previously known personalization technologies rely on building a model of a user's preferences. Therefore, personalization requires modeling the user's mind with as many of the attendant subtleties as possible. Ideally, a perfect computer model of a user's brain would determine the user's preferences exactly and track them as the user's tastes, context, or location change. Such a model would allow a personal newspaper, for example, to contain all of the articles the user is interested in, and none of the articles in which the user is not interested. The perfect model would also allow advertisers to generate banner ads with 100% ‘click-through’ rates (i.e., a viewer would peruse every ad displayed) and would allow e-commerce vendors to present only products that each given user would buy.
Fill-in profiles represent the simplest form of user modeling for personalization technology. When using a fill-in profile, the user fills in a form, which may ask for demographic information such as income, education, children, zip code, sex and age. The form may further ask for interest information such as sports, hobbies, entertainment, fashion, technology or news about a particular region or institution. Internet web sites that have registration procedures typically request information of this sort. Vendors may target advertising based on these profiles in exchange for users having access to the content site. Such profiles are the basis for almost all of the targeted advertising currently used on the Internet. This type of simple user model misses much of the richness of a user's interests because these interests do not necessarily fall into neat categories. Privacy-concerned users may also purposefully enter inaccurate information when forced to deal with this model. Furthermore, most people have trouble articulating the full range of their preferences even when not restricted by a form.
Another filtering method is called ‘clique-based recommendation’, which is also known as ‘collaborative filtering’. This method presumes that if a person's stated preferences are similar to those of a group or clique of others in some aspects, the person's preferences will also be similar to the clique's preferences in other aspects. For example, if a particular viewer likes a certain set of movies and a clique of other viewers enjoy the same set, then it is likely that other movies enjoyed by that clique will also be enjoyed by the viewer. Because the Internet makes it easy to collect preference information for a large group, collaborative filtering has become the basis for many presently known recommendation services. Note that collaborative filtering is a richer form of recommendation than a fill-in profile because, for example, it is difficult to characterize a book simply by noting that it is in the category of sports. A problem with clique-based systems, however, is the need for explicit feedback by the user, such as a buying or rating decision.
Feature-based recommendation is a more sophisticated form of preference profiling because it considers multiple aspects of a product and how they may interact. For example, a person may like movies that have the features of action-adventure, rated R (but not G), and which receive a good review by a particular critic. Such features or attributes of a product can be used to create a sophisticated preference model for an individual user. A multiple-feature classifier such as a neural network can capture the complexity of user preferences if the feature set is rich enough.
Text-based recommendation is a rich form of feature-based recommendation. Years of research in information retrieval has yielded methods of characterizing text which are quite effective. These methods are generally referred to as word vector-space methods. The concept behind text-based ‘recommenders’ is that documents containing the same frequencies of words can be grouped together or clustered. Documents whose word frequencies are similar are considered closely clustered in the word vector space. Thus, if a user selects certain documents, then it is likely that the user would want to read other documents that have similar word frequencies. Because most of the information on the Internet (including news, product descriptions, and advertising) is in the form of text, text-based recommendation methods can be used to more accurately determine users' preferences for all sorts of Internet information. It is desirable that such methods be completely unobtrusive to a user, by not requiring the user to fill in a form or rate products.
Several techniques are known in the art for prioritizing word-based content by asking users to rate articles on a numerical scale. These techniques assemble training data that contains both positive (highly rated articles) and negative (low rated articles) data. However, the need to rate articles is a burden to users. If a user is asked to look at all the articles in an archive or news site and read all the ones of interest, it is also possible to assemble a set of positive data (all the articles the user read or clicked on) and negative data (all those not read). Although the user is not asked for a numerical rank, a binary value can be assigned to each article (either read or not read). However, this, too, is a burden. The more usual scenario for an online newspaper has a reader perusing some of the articles but not having time to read all of them. One cannot assume, a priori, that unread articles are of no interest to the user, so the negative data are thus uncertain. Thus, what is needed is a truly unobtrusive system which operates on only positive data.