This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a xe2x80x9ctarget profilexe2x80x9d for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a xe2x80x9ctarget profile interest summaryxe2x80x9d for each user, which target profile interest summary describes the user""s interest level in various types of tar get objects. The system then evaluates the target profiles against the users"" target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled. on the electronic media. Users"" target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically based proxy server is provided to ensure privacy of a user""s target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.
It is a problem in the field of electronic media to enable a user to access information of relevance and interest to the user without requiring the user to expend an excessive amount of time and energy searching for the information. Electronic media, such as on-line information sources, provide a vast amount of information to users, typically in the form of xe2x80x9carticles,xe2x80x9d each of which comprises a publication item, or document that relates to a specific topic. The difficulty with electronic media is that the amount of information available to the user is overwhelming and the article repository systems that are connected on-line are not organized in a manner that sufficiently simplifies access to only the articles-of interest to the user. Presently, a user either fails to access relevant articles because they are not easily identified or expends a significant amount of time and energy to conduct an exhaustive search of all articles to identify those most likely to be of interest to the user. Furthermore, even if the user conducts an exhaustive search, present information searching techniques do not necessarily accurately extract only the most relevant articles, but also present articles of marginal relevance due to the functional limitations of the information searching techniques. There is also no existing system which automatically estimates the inherent quality of a n article or other target object to distinguish among a number of articles or target objects identified as of possible interest to a user.
Therefore, in the field of information retrieval, there is a long-standing need for a system which enables users to navigate through the plethora of information. With commercialization of communication networks, such as the Internet, the growth of available information has increased. Customization of the information delivery process to the user""s unique tastes and interests is the ultimate solution to this problem. However, the techniques which have been proposed to date either only address the user""s interests on a superficial level or provide greater depth and intelligence at the cost of unwanted demands on the user""s time and energy. While many researchers have agreed that traditional methods have been lacking in this regard, no one to date has successfully addressed these problems in a holistic manner and provided a system that can fully learn and reflect the user""s tastes and interests. This is particularly true in a practical commercial context, such as on-line services available on the Internet. There is a need for an information retrieval system, that is largely or entirely passive, unobtrusive, undemanding of the user, and yet both precise and comprehensive in its ability to learn and truly represent the user""s tastes and interests. Present information retrieval systems require the user to specify the desired information retrieval behavior through cumbersome interfaces.
Users may receive information on a computer network either by actively retrieving the information or by passively receiving information that is sent to them. Just as users of information retrieval systems face the problem of too much information, so do users who are targeted with electronic junk mail by individuals and organizations. An ideal system would protect the user from unsolicited advertising, both by automatically extracting only the most relevant messages received by electronic mail, and by preserving the confidentiality of the user""s preferences, which should not be freely available to others on the network.
Researchers in the field of published article information retrieval have devoted considerable effort to finding efficient and accurate methods of allowing users to select articles of interest from a large set of articles. The most widely used methods of information retrieval are based on keyword matching: the user specifies a set of keywords which the user thinks are exclusively found in the desired articles and the information retrieval computer retrieves all articles which contain those keywords. Such methods are fast, but are notoriously unreliable, as users may not think of the right keywords, or the keywords may be used in unwanted articles in an irrelevant or unexpected context. As a result, the information retrieval computers retrieve many articles which are unwanted by the user. The logical combination of keywords and the use of wild-card search parameters help improve the accuracy of keyword searching but do not completely solve the problem of inaccurate search results. Starting in the 1960""s, an alternate approach to information retrieval was developed: users were presented with an article and asked if it contained the information they wanted, or to quantify how close the information contained in the article was to what they wanted. Each article was described by a profile which comprised either a list of the words in the article or, in more advanced systems, a table of word frequencies in the article. Since a measure of similarity between articles is the distance between their profiles, the measured similarity of article profiles can be used in article retrieval. For example, a user searching for information on a subject can write a short description of the desired information. The information retrieval computer generates an article profile for the request and then retrieves articles with profiles similar to the profile generated for the request. These requests can then be refined using xe2x80x9crelevance feedbackxe2x80x9d, where the user actively or passively rates the articles retrieved as to how close the information contained therein is to what is desired. The information retrieval computer then uses this relevance feedback information to refine the request profile and the process is repeated until the user either finds enough articles or tires of the search.
A number of researchers have looked at methods for selecting articles of most interest to users. An article titled xe2x80x9cSocial Information filtering: algorithms for automating xe2x80x98word of mouthxe2x80x99xe2x80x9d was published at the CHi-95 Proceedings by Patti Maes et al and describes the Ringo information retrieval system which recommends musical selections. The Ringo system requires active feedback from the usersxe2x80x94users must manually specify how much they like or dislike each musical selection. The Ringo system maintains a complete list of users ratings of music selections and makes recommendations by finding which selections were liked by multiple people. However, the Ringo system does not take advantage of any available descriptions of the music, such as structured descriptions in a data base, or free text, such as that contained in music reviews. An article titled xe2x80x9cEvolving agents for personalized information filteringxe2x80x9d, published at the Proc. 9th IEEE Conf on AI for Applications by Sheth and Maes, described the use of agents for information filtering which use genetic algorithms to learn to categorize Usenet news articles. In this system, users must define news categories and the users actively indicate their opinion of the selected articles. Their system uses a list of keywords to represent sets of articles and the records of users"" interests are updated using genetic algorithms.
A number of other research groups have looked at the automatic generation and labeling of clusters of articles for the purpose of browsing through the articles. A group at Xerox Parc published a paper titled xe2x80x9cScatter/gather: a cluster-based approach to browsing large article collectionsxe2x80x9d at the 15 Ann. Int""l SIGIR ""92, ACM 318-329 (Cutting et al. 1992). This group developed a method they call xe2x80x9cscatter/gatherxe2x80x9d for performing information retrieval searches. In this method, a collection of articles is xe2x80x9cscatteredxe2x80x9d into a small number of clusters, the user then chooses one or more of these clusters based on short summaries of the cluster. The selected clusters are then xe2x80x9cgatheredxe2x80x9d into a subcollection, and then the process is repeated. Each iteration of this process is expected to produce a small, more focused collection. The cluster xe2x80x9csummariesxe2x80x9d are generated by picking those words which appear most frequently in the cluster and the titles of those articles closest to the center of the cluster. However, no feedback from users is collected or stored, so no performance improvement occurs over time.
Apple""s Advanced Technology Group has developed an interface based on the concept of a xe2x80x9cpile of articlesxe2x80x9d. This interface is described in an article titled xe2x80x9cxe2x80x98A pilexe2x80x99 metaphor for supporting casual organization of information in Human factors in computer systemsxe2x80x9d published in CHI ""92 Conf. Proc. 627-634 by Mander, R. G. Salomon and Y. Wong. 1992. Another article titled xe2x80x9cContent awareness in a file system interface: implementing the xe2x80x98pilexe2x80x99 metaphor for organizing informationxe2x80x9d was published in 16 Ann. Int""l SIGIR ""93, ACM 260-269 by Rose E. D. et al. The Apple interface uses word frequencies to automatically file articles by picking the pile most similar to the article being filed. This system functions to cluster articles into subpiles, determine key words for indexing by picking the words with the largest TF/IDF (where TF is term (word) frequency and IDF is the inverse document frequency) and label piles by using the determined key words.
Numerous patents address information retrieval methods, but none develop records of a user""s interest based on passive monitoring of which articles the user accesses. None of the systems described in these patents pre sent computer architectures to allow fast retrieval of articles distributed across many computers. None of the systems described in these patents address issues of using such article retrieval and matching methods for purposes of commerce or of matching users with common interests or developing records of users"" interests. U.S. Pat. No. 5,321,833 issued to Chang et al. teaches a method in which users choose terms to use in an information retrieval query, and specify the relative weightings of the different terms. The Chang system then calculates multiple levels of weighting criteria. U.S. Pat. No. 5,301,109 issued to Landauer et al teaches a method for retrieving articles in a multiplicity of languages by constructing xe2x80x9clatent vectorsxe2x80x9d (SVD or PCA vectors) which represent correlations between the different words. U.S. Pat. No. 5,331,554 issued to Graham et al. discloses a method for retrieving segments of a manual by comparing a query with nodes in a decision tree. U.S. Pat. No. 5,331,556 addresses techniques for deriving morphological part-of-speech information and thus to make :use of the similarities of different forms of the same word (e.g. xe2x80x9carticlexe2x80x9d and xe2x80x9carticlesxe2x80x9d).
Therefore, there presently is no information retrieval and delivery system operable in an electronic media environment that enables a user to access information of relevance and interest to the user without requiring the user to expend an excessive amount of time and energy.
The above-described problems are solved and a technical advance achieved in the field by the system for customized electronic identification of desirable objects in an electronic media environment, which system enables a user to access target objects of relevance and interest to the user without requiring the user to expend an excessive amount of time and energy. Profiles of the target objects are stored on electronic media and are accessible via a data communication network. In many applications, the target objects are informational I n nature, and so may themselves be stored on electronic media and be accessible via a data communication network.
Relevant definitions of terms for the purpose of this description include: (a.) an object available for access by the user, which may be either physical or electronic in nature, is termed a xe2x80x9ctarget objectxe2x80x9d, (b.) a digitally represented profile indicating t hat target object""s attributes is termed a xe2x80x9ctarget profilexe2x80x9d, (c.) the user looking for the target object is termed a xe2x80x9cuserxe2x80x9d, (d.) a profile holding that user""s attributes, including age/zip code/etc. is termed a xe2x80x9cuser profilexe2x80x9d, (e.) a summary of digital profiles of target objects that a user likes and/or dislikes, is termed the xe2x80x9ctarget profile interest summaryxe2x80x9d of that user, (f) a profile consisting of a collection of attributes, such that a user likes target objects whose profiles are similar to this collection, of attributes, is termed a xe2x80x9csearch profilexe2x80x9d or in some contexts a xe2x80x9cqueryxe2x80x9d or xe2x80x9cquery profile,xe2x80x9d (g.) a specific embodiment of the target profile interest summary which comprises a set of search profiles is termed the xe2x80x9csearch profile setxe2x80x9d of a user, (h.) a collection of target objects with similar profiles, is termed a xe2x80x9ccluster,xe2x80x9d (i.) an aggregate profile formed by averaging the attributes of all tar get objects in a cluster, termed a xe2x80x9ccluster profile,xe2x80x9d (j.) a real number determined by calculating the statistical variance of the profiles of all target objects in a cluster, is termed a xe2x80x9ccluster variance,xe2x80x9d (k.) a real number determined by calculating the maximum distance between the profiles of any two target objects in a cluster, is termed a xe2x80x9ccluster diameter.xe2x80x9d
The system for electronic identification of desirable objects of the present invention automatically constructs both a target profile for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a xe2x80x9ctarget profile interest summaryxe2x80x9d for each user, which target profile interest summary describes the user""s interest level in various types of target objects. The system then evaluates the target profiles against the users"" target profile interest summaries to generate a user-customized rank ordered listing of tar get objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects available on the electronic media.
Because people have multiple interests, a target profile interest, summary for a single user must represent multiple areas of interest, for example, by consisting of a set of individual search profiles, each of which identifies one of the user""s areas of interest. Each user is presented with those target objects whose profiles most closely match the user""s interests as described by the user""s target profile interest summary. Users"" target profile interest summaries are automatically updated on a continuing basis to reflect each user""s changing interests. In addition, target objects can be grouped into clusters based on their similarity to each other, for example, based on similarity of their topics in the case where the target objects are published articles; and menus automatically generated for each cluster of target objects to allow users to navigate throughout the clusters and manually locate target objects of interest. For reasons of confidentiality and privacy, a particular user may not wish to make public all of the interests recorded in the user""s target profile interest summary, particularly when these interests are determined by the user""s purchasing patterns. The user may desire that all or part of the target profile interest summary be kept confidential, such as information relating to the user""s political, religious, financial or purchasing behavior; indeed, confidentiality with respect to purchasing behavior is the user""s legal right in many states. It is therefore necessary that data in a user""s target profile interest summary be protected from unwanted disclosure except with the user""s agreement. At the same time, the user""s target profile interest summaries must be accessible to the relevant servers that perform the matching of target objects to the users, if the benefit of this matching is desired by both providers and consumers of the target objects. The disclosed system provides a solution to the privacy problem by using a proxy server which acts as an intermediary between the information provider and the user. The proxy server dissociates the user""s true identity from the pseudonym by the use of cryptographic techniques. The proxy server also permits users to control access to their target profile interest summaries and/or user profiles, including provision of this information to marketers and advertisers if they so desire, possibly in exchange for cash or other considerations. Marketers may purchase these profiles in order to target advertisements to particular users, or they may purchase partial user profiles, which do not include enough information to identify the individual users in question, in order to carry out standard kinds of demographic analysis and market research on the resulting database of partial user profiles.
In the preferred embodiment of the invention, the system for customized electronic identification of desirable objects uses a fundamental methodology for accurately and efficiently matching users and target objects by automatically calculating, using and updating profile information that describes both the users"" interests and the target objects"" characteristics. The target objects may be published articles, purchasable items, or even other people, and their properties are stored, and/or represented and/or denoted on the electronic media as (digital) data. Examples of target objects can include, but are not limited to: a newspaper story of potential interest, a movie to watch, an item to buy, e-mail to receive, or another person to correspond with. In all these cases, the information delivery process in the preferred embodiment is based on determining the similarity between a profile for the target object and the profiles of target objects for which the user (or a similar user) has provided positive feedback in the past. The individual data that describe a target object and constitute the target object""s profile are herein termed xe2x80x9cattributesxe2x80x9d of the target object. Attributes may include, but are not limited to, the following: (1) long pieces of text ( a newspaper story, a movie review, a product description or an advertisement), (2) short pieces of text (name of a movie""s director, name of town from which an advertisement was placed, name of the language in which an article was written), (3) numeric measurements (price of a product, rating given to a movie, reading level of a book), (4) associations with other types of objects (list of actors in a movie, list of persons who have read a document). Any of these attributes, but especially the numeric ones, may correlate with the quality of the target object, such as measures of its popularity (how often it is accessed) or of user satisfaction (number of complaints received).
The preferred embodiment of the system for customized electronic identification of desirable objects operates in an electronic media environment for accessing these target objects, which may be news, electronic mail, other published documents, or product descriptions. The system in its broadest construction comprises three conceptual modules, which may be separate entities distributed across many implementing systems, or combined into a lesser subset of physical entities. The specific embodiment of this system disclosed herein illustrates the use of a first module which automatically constructs a xe2x80x9ctarget profilexe2x80x9d for each target object in the electronic media based on various descriptive attributes of the target object. A second module uses interest feedback from users to construct a xe2x80x9ctarget profile interest summaryxe2x80x9d for each user, for example in the form of a xe2x80x9csearch profile setxe2x80x9d consisting of a plurality of search profiles, each of which corresponds to a single topic of high interest for the user. The system further includes a profile processing module which estimates each user""s interest in various target objects by reference to the users"" target profile interest summaries, for example by comparing the target profiles of these target objects against the search profiles in users"" search profile sets, and generates for each user a customized rank-ordered listing of target objects most likely to be of interest to that user. Each user""s target profile interest summary is automatically updated on a continuing basis to, reflect the user""s changing interests.
Target objects may be of various sorts, and it is sometimes advantageous to use a single system that delivers and/or clusters target objects of several distinct sorts at once, in a unified framework. For example, users who exhibit a strong interest in certain novels may also show an interest in certain movies, presumably of a similar nature. A system in which some target objects are novels and other target objects are movies can discover such a correlation and exploit it in order to group particular novels with particular movies, e.g., for clustering purposes, or to recommend the movies to a user who has demonstrated interest in the novels. Similarly, if users who exhibit an interest in certain World Wide Web sites also exhibit an interest in certain products, the system can match the products with the sites and thereby recommend to the marketers of those products that they place advertisements at those sites, e.g., in the form of hypertext links to their own sites.
The ability to measure the similarity of profiles describing target objects and a user""s interests can be applied in two basic ways: filtering and browsing. Filtering is useful when large numbers of target objects are described in the electronic media s pace. These target objects can for example be articles that are received or potentially received by a user, who only has time to read a small fraction of them. For example, one might potentially receive all items on the AP news wire service, all items posted to a number of news groups, all advertisements in a set of newspapers, or all unsolicited electronic mail, but few people have the time or inclination to read so many articles. A filtering system in the system for customized electronic identification of, desirable objects automatically selects a set of articles that the user is likely to wish to read. The accuracy of this filtering system improves over time by noting which articles the user reads and by generating a measurement of the depth to which the user reads each article. This information is then used to update the user""s target profile interest summary. Browsing provides an alternate method of selecting a small subset of a large number of target objects, such as articles. Articles are organized so that users can actively navigate among groups of articles by moving from one group to a larger, more general group, to a smaller, more specific group, or to a closely related group. Each individual article forms a one-member group of its own, so that the user can navigate to and from individual articles as well as larger groups. The methods used by the system for customized electronic identification of desirable objects allow articles to be grouped into clusters and the clusters to be grouped and merged into larger and larger clusters. These hierarchies of clusters then form the basis for menuing and navigational systems to allow the rapid searching of large numbers of articles. This same clustering technique is applicable to any type of target objects that can be profiled on the electronic media.
There are a number of variations on the theme of developing and using profiles for article retrieval, with the basic implementation of an on-line news clipping service representing the preferred embodiment of the invention. Variations of this basic system are disclosed and comprise a system to filter electronic mail, an extension for retrieval of target objects such as purchasable items which may have more complex descriptions, a system to automatically build and alter menuing systems for browsing and searching through large numbers of target objects, and a system to construct virtual communities of people with common interests. These intelligent filters and browsers are necessary to provide a truly passive, intelligent system interface. A user interface that permits intuitive browsing and filtering represents for the first time an intelligent system for determining the affinities between users and target objects. The detailed, comprehensive target profiles and user-specific target profile interest summaries enable the system to provide responsive routing of specific queries for user information access. The information maps so produced and the application of users"" target profile interest summaries to predict the information consumption patterns of a user allows for pre-caching of data at locations on the data communication network and at times that minimize the traffic flow in the communication network to thereby efficiently provide the desired information to the user and/or conserve valuable storage space by only storing those target objects (or segments thereof) which are relevant to the user""s interests.