There are many situations where it is useful to be able to distinguish and interpret patterns in data sets and to be able to use such a pattern for selecting or ranking a set of items or users. In a typical situation, automatic predictions of different users interests or preferences may be used for obtaining some kind of ranking or intelligent selection between a range of alternatives. Such predictions typically rely on collected information which is filtered, using some filtering mechanism, and on the underlying assumption that those users who had a similar taste in the past often tend to agree also in the near future. This principle may be used for various recommendation systems where preferences of a number of users having a similar “preference pattern” as a reference user may be useful for recommending a selection of items to the reference user. Such a recommendation system, may typically be directed to music, movies, restaurants travelling destinations, etc.
Collaborative filtering is one of the most successful methods used in present product recommendation systems. The collaborative filtering concept is heavily based on finding correlations between users or items. The methods normally used to find these correlations typically refer to traditional distance and vector correlation measures, such as e.g. the Cosine correlation method, the Adjusted cosine correlation method, the Pearson correlation method, and the Spearman correlation method. When using any of the mentioned measures, a correlation is derived in the interval [−1,1], where −1 represents a decreasing linear relationship, while 1 represents an increasing linear relationship between correlated items or users. The higher the absolute correlation value, the stronger the correlation between the users or items is.
A correlation between two independent users or items will result in perpendicular vectors and a correlation which equals 0. Variables which have correlation 0 are, however, not necessarily independent. Since the described correlation coefficients only detect linear dependencies, it may therefore be difficult to interpret a result in a correct and reliable way in this type of situations.
In collaborative filtering the data to be processed is typically represented by a user-item matrix, R, as illustrated with FIG. 1. In the figure matrix R comprises rating data, typically provided from m users, u1 . . . um, where each user is represented by a row-vector, i1 . . . in, in an n-dimensional space capable of covering n items. For each of the items in the matrix a rating, R1,1 . . . Rm,n, respectively, can be specified by a respective user, where each item in the matrix is represented by a column-vector in an m-dimensional space. In a typical scenario each position in the matrix will either comprise a rating that has been given to the respective item by a specific user, or be blank, for the occasion that the user for some reason has not rated that particular item. From hereinafter, this document will refer only to correlations between users. It should, however, be obvious to any person skilled in the art that correlations between users only is given as one possible exemplification, and that also the alternative approach of instead performing correlations between different items may be applicable in a corresponding way.
An example of a vector representation of a user which has given a number of ratings for a specific series of items is illustrated below, where a user, k, has given certain items, e.g. some watched films, out of a series, i1 . . . in, of items available for rating, a rating on a predefined scale. In this case the scale is a 1 to 5 scale, where 1 may represent the lowest rate, and 5 the highest rate. Items 1, 3 and n−1 have not been rated at all, and, thus are left blank.
                                                i        1                            i        2                            i        3                            i        4                            i        5                    …                      i                  n          -          1                                    i        n                        k                                          5                                          4              3                                                                      2      
By correlating user vectors associated with two respective users, two by two, the users that have the most similar taste, or which have the taste that differ the most between each other, may be identified. Once identified this information may be used, e.g. for ranking and for recommending additional items to the user at focus, on the basis of the ranking.
However, only the correlations between co-rated items, i.e. those items for which both users have given a rating, can be used in the calculations for obtaining a measure of the interrelationship between the two users. Such a set of co-rated items can be denoted by:|iεIu∩Iv|  (1)
Often the group of co-rated items is relatively small compared to the whole set of items, i.e. only a limited number of items which can be rated have actually been rated by a user. A situation, where the executed correlations are based on a relatively small set of data, may imply a false linear dependency, and, thus, an incorrect indication of corresponding, or deviating, user preferences in the particular field considered.
Even though each one of the different correlation methods mentioned above have their own strength of finding correlations between users, they all also have their weaknesses, which make it difficult to choose an overall suitable correlation method or scheme which will give a reliable result for a wide variety of possible scenarios. Tests show that all of the mentioned correlation methods tend to be more or less inaccurate when the Euclidean distance between at least some of the co-rated items of two vectors is large, or in other words, when one or a few co-rated items deviate from the majority of co-rated items. Inaccuracy also appears in many situations where a sparse set of rated items are correlated. As already mentioned the different correlation methods known from prior art solutions only find and use linear correlations between the users or items.