a. Field of the Invention
The present invention concerns matching entities based on attributes of the entities and predicting an attribute of an entity based on attributes of the entity and other entities. More specifically, the present invention concerns "collaborative filtering" which may be used, for example, to suggest content of interest to a client entity on a network (e.g., the Internet).
b. Related Art
In the past five to ten years, computers have become interconnected by networks by an ever increasing extent via the Internet. The proliferation of networks, in conjunction with the increased availability of inexpensive data storage means, has afforded computer users unprecedented access to a wealth of data. Unfortunately, however, the very vastness of available data can overwhelm a user. Desired data can become difficult to find and search heuristics employed to locate desired data often return unwanted data.
Various concepts have been employed to help users locate desired data. In the context of the Internet for example, some services have organized content based on a hierarchy of categories. A user may then navigate through a series of hierarchical menus to find content that may be of interest to them. An example of such a service is the YAHOO.TM. World Wide Web site on the Internet. Unfortunately, content, in the form of Internet "web sites" for example, must be organized by the service and users must navigate through menus. If a user mistakenly believes that a category will be of interest or include what they were looking for, but the category turns out to be irrelevant, the user must backtrack through one or more hierarchical levels of categories. Moreover, such services which provide hierarchical menus of categories are passive. That is, a user must actively navigate through the hierarchical menus of categories.
Again in the context of the Internet for example, some services provide "search engines" which search database content or "web sites" pursuant to a user query. In response to a user's query, a rank ordered list, which includes brief descriptions of the uncovered content, as well as hypertext links (text, having associated Internet address information, which, when activated, commands a computer to retrieve content from the associated Internet address) to the uncovered content is returned. The rank ordering of the list is typically based on a match between words appearing in the query and words appearing in the content. Unfortunately, however, present limitations of search heuristics often cause irrelevant content to be returned in response to a query. Again, unfortunately, the very wealth of available content impairs the efficacy of these search engines since it is difficult to separate irrelevant content from relevant content. Moreover, as was the case with services which provide hierarchical menus of categories, search engines are passive. That is, a user must actively submit a query.
The two above mentioned content search concepts are categorized as "pull" processes because the user must explicitly direct these processes to find the content and pull it to them (i.e., to their computer).
In view of the drawbacks of the above discussed data location concepts, "collaborative filtering" systems have been developed. Collaborative filtering systems predict the preferences of a user based on known attributes of the user, as well as known attributes of other users. Some collaborative filtering systems require that a user fill out a survey of his(her) interests and use the submitted survey as a query. Hence, such collaborative filtering systems may be classified as "pull" processes. Other collaborative filtering systems are categorized as "push" processes because they use content previously "consumed" (e.g., requested, downloaded, rendered, etc.) by a user to proactively predict content which may appeal to that user. Such collaborative filtering systems then present (or "push") the content, or information identifying the content, to the user.
Basically, collaborative filtering uses known attributes (e.g., explicitly entered votes) of a new user (referred to as "the active case") and known attributes of other users to predict values of unknown attributes of the new user (e.g., attributes not yet entered by the new user). The mean vote (.nu.i) for an entity may be defined as: ##EQU1## where
______________________________________ V.sub.i,j .ident. A value of attribute j of entity i. Typically, an integer value. m .ident. The number of attributes (e.g., in a database). I.sub.i .ident. A set of attribute indexes for which entity i has known values (e.g., based on an explicitly entered vote). For example, I.sub.2 = {3,4} means that entity 2 has values (e.g., has voted) for attributes 3 and 4. m.sub.i .ident. The number of attributes for which entity i has known values--the number of elements in I.sub.i. ______________________________________
Denoting parameters for the active case (i.e., new entity) with subscript a, a prediction p.sub.a,j of active case attribute values (e.g., votes) for attributes without known values (i.e., attributes not in I.sub.a) can be defined ##EQU2## where
______________________________________ n .ident. The number of entities (e.g., in a database). w.sub.a,i .ident. The estimated weight (or alternatively match) between entity i and entity a. p.sub.i,j .ident. The predicted value of attribute j of entity i. ______________________________________
Hence, a predicted attribute value (e.g., vote) is calculated from a weighted sum of the attribute values (e.g., votes) of each other user. The appearance of mean values in the formula merely serves to express values in terms of deviation from the mean value (i.e., defines a reference) and has no other significant impact.
An example of a proposed collaborative filtering system is discussed in the article, Resnick et al., "GroupLens: An Open Architecture for Collaborative Filtering of Netnews," Proceedings of the Association for Computer Machinery 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, N.C., pp. 175-186 (1994) (hereafter referred to as "the Resnick article"). In the system discussed in the Resnick article (hereafter referred to as "the Grouplens system"), users rate articles which they have read. Rating servers, called Better Bit Bureaus, gather and disseminate the ratings. More specifically, the Better Bit Bureaus package one or more ratings into a news article. The rating servers predict scores based on a heuristic that people who agreed in the past will probably agree again. More specifically, the GroupLens system first correlates ratings to determine the similarity of a user's ratings with the ratings of other users. Correlation coefficients or weights between -1 and 1 are computed and indicate how much a particular user tended to agree with other users. The GroupLens system then predicts how much the user will like a new article based on ratings from similar users. More specifically, the ratings of the other users are weighted based on the correlation coefficients determined above and the weighted ratings are combined to form a prediction.
Unfortunately, the GroupLens system has a number of problems. First, users must explicitly enter ratings. Some users find it difficult to judge articles or other content. In this regard, it is expected that predictions made by the GroupLens system will improve as correlation or weight determinations improve. It is further expected that the correlation and weight determinations made by the GroupLens system will improve as more ratings are entered. Unfortunately, many users may become frustrated by poor predictions and/or with entering ratings before enough ratings are gathered to make the correlation and weight determinations made by the GroupLens system good. Thus, the GroupLens system has a bootstrapping problem. Many users will become frustrated with the predictions made by the GroupLens system, due, in part, to an initial scarcity of ratings. As a result of user frustration with initially poor predictions, such users may stop entering ratings. If this occurs, the predictions made by the GroupLens system will probably not improve because users will not provide it with enough ratings information.
Moreover, the correlation strategy used in the GroupLens system apparently does not consider the distinctness of the ratings. For example, the fact that two users might like a popular article is apparently not weighted less than the fact that two users might like an very unpopular article. Furthermore, the GroupLens system apparently does not consider non-data, or the absence of ratings by users.
Thus, improved content location methods and apparatus are needed. Since burdens formerly placed on the entity (e.g., a computer user) should be eliminated to the extent possible, such methods and apparatus (i) should be useable in content push systems, such as collaborative filtering systems for example, and (ii) should use entity attributes which may be explicitly and/or implicitly determined. Since the content should be only the most relevant or most likely to be of interest to the entity, such methods and apparatus should accurately match entities based on attributes of the entities and should accurately predict attributes of (e.g., content of interest to) an entity based on attributes of the entity and other entities. Finally, the methods and apparatus should be able to operate on a distributed environment, such as a networked environment including clients and servers.