Publications and other reference materials referred to herein, including reference cited therein, are incorporated herein by reference in their entirety and are numerically referenced in the following text and respectively grouped in the appended Bibliography which immediately precedes the claims.
Content-based filtering deals with comparing representations of the content of items (e.g. documents, news) with representations of users' (readers of the items) interests, in order to find the items that are most relevant to each user [1]. This poses a task of finding the best representation for both the items (item profile) and the users (user profile). A user profile represents a mapping of the actual user's interest to a compact model space, which approximates the user's actual real world interests. A user's profile and an item's profile should share a common method of representation (for example, representation by keywords) in order to enable matching between the profiles. The output of the matching process is expressed as a ranking score, indicating the similarity between the user's profile and a given item.
The content-based filtering approach is based on the information retrieval domain and employs many of the same techniques. However, information filtering differs from information retrieval in the representation of the users' interests. Instead of using ad-hoc queries, as in information retrieval, the filtering system tries to model the users' long-term interests in a form of user profiles. User profiles, as well as item profiles, may consist of sets of terms. The filtering system selects and rank-orders items based on the similarity of their profiles to the user's profile.
The relevancy of items read by a user can be rated by explicit or implicit user feedback. Explicit feedback requires the user to express the degree of relevancy of a read item, while in implicit feedback the relevancy of an item to the user is inferred by observing the user's actions, e.g. reading time. Implicit feedback may be more convenient for the user but more difficult to implement and less accurate. User feedback enables the user's profile to be updated according to what she actually read, liked or disliked.
There exist two main approaches in filtering: content-based filtering and collaborative filtering. In collaborative filtering, the system selects and rank-orders items for a user based on the similarity of the user to other users who read/liked similar items in the past. In content-based filtering, the system selects and rank-orders items based on content, i.e., on the similarity of the user's profile and the items' profiles.
A major advantage of content-based filtering is that users can get insight into the motivation why the system considers that items are interesting for them since the content of each item is known from its representation. Content-based filters are less affected by problems of collaborative filtering systems, such as “cold start” and scarcity: If a new item is added to the database, it cannot be recommended to a user by a collaborative filter before enough users read/rate it. Moreover, if the number of users is small relative to the volume of items in the system, there is a danger of the coverage of ratings becoming very sparse, thinning the collection of recommendable items. For a user whose tastes are unusual, compared to the rest of the population, the system will not be able to locate users who are particularly similar, leading to poor recommendations
But content-based filtering has disadvantages too:    1. Keyword-based content filtering focus on content similarity among items. This approach, however, is incapable of capturing more complex relationships at a deeper semantic level based on different types of attributes associated with structured objects of the text. Consequently, many items are missed and many irrelevant items are retrieved.    2. Unlike humans, content-based techniques have difficulty in distinguishing between high quality and low quality information, since both good and bad information might be represented by the same terms. As the number of items increases, the number of items in the same content-based category increases too, further decreasing the effectiveness of content-based approaches.    3. Content-based methods require analyzing the content of the document, which is computationally expensive and even impossible to perform on multimedia items, which do not contain text.
To expand the first point of the disadvantages, it can be added that there is a tremendous diversity in the words people use to describe the same concept (synonymy), and this places strict and low limits on the expected performance of keyword systems. If the user uses different words from the organizer (indexer) of the information, relevant materials might be missed. On the other hand, the same word can have more than one meaning (homonyms), leading to irrelevant materials being retrieved. This disadvantage is added to the fact that the basic models of content-based filtering assume a representation of documents as sets or vectors of index-terms, and typically employ only primitive search strategies based solely on the occurrence of string sequences (term) or combinations of terms.
In order to generate a representation for an item in a traditional content-based filtering method, the item has to be analyzed, possibly with a text classification algorithm, which extracts keywords/terms representing the item's content in the best way. This is one major drawback of content-based filtering, since this kind of representation causes ambiguity problems.
One way of dealing with the ambiguity is using ontology, which consists of a controlled vocabulary of terms or concepts, and semantic relationships among them. An ontology can bridge the gap between the user profile's terms and the terms used to represent the items. An ontology can be organized in a hierarchy of terms/concepts, according to their meaning.
An example of a domain ontology, that, as will be described herein, is used in this invention, is IPTC NewsCodes [2], constructed from the subjects that can be associated with News items. This is a 3-level hierarchical ontology of concepts targeted to News description, currently containing approximately 1,400 concepts. A first level concept of NewsCodes is called Subject; a second level—SubjectMatter, and a third, most specific level—SubjectDetail. FIG. 1 is an example of IPTC NewsCodes ontology.
The use of conceptual modeling in general, and ontology in particular, was initially incorporated by researchers with the intention to increase the accuracy of content-based filtering compared to traditional keyword-based methods. Ontological and conceptual modeling was used in order to extract user profiles, such as the four-level ontology used in the Quickstep system [4]. Another user-profile acquisition method exploiting conceptual modeling was presented in SiteIF system [3]. One of the first hierarchical representations for describing documents and user profiles, by attaching metadata to each document and using the same method to generate a compatible representation of users' interests was presented in [6]. Another work that used ontology for content-based retrieval was the electronic publishing system CoMet [5].
It is a purpose of the present invention to provide a novel ontology-based method of filtering and ranking the relevance of items to specific users.
It is another purpose of the present invention to provide a method that measures the similarity between items' and users' profiles in a unique way which has not been used in any other method.
It is another purpose of the present invention to provide a method of filtering and ranking the relevance of news content to specific readers in order to allow production of personalized newspapers.
Further purposes and advantages of this invention will appear as the description proceeds.