Since the discovery of the Internet by the general public in the early 90's and the subsequent widespread advent of consumer electronic commerce mediated by the World Wide Web in the mid and late 90's, recommendation systems have become relatively commonplace. These systems are used to select products for display to consumers as they browse web sites and are generally based on some sort of analysis of the purchasing and/or browsing behavior of consumers visiting the site.
As e-commerce has become more prevalent, more and more data has become available for recommendation engines to analyze. These large amounts of data improve the robustness of results, but create scalability problems. Furthermore, the complexity of the interactions represented in the data has the potential to increase dramatically. Existing systems often ignore such complexities, focusing on a model that embodies a radical simplification of the actual relationships and interactions among entities. Thus, most conventional systems generate models that operate in terms a single kind of interaction between one kind of active agent (the consumer) and one kind of object (books, music, web pages or video as the case might be). This simplification is used by conventional systems to provide reasonable recommendations using standard mathematical techniques when large amounts of data are available. However, such a simplification can yield substandard results, and can limit the flexibility and power of the recommendation system.
Dyadic Learning Techniques
Many conventional recommendation systems approach the problem of recommendation by constructing a bipartite set containing two kinds of objects, X and Y related by a single relation. In these systems, training data T={(xi,yi,ki)} is provided, where xiεX, yiεY and kiεR. The value ki is an optional association strength. An input Z={zi} is provided, where z1εY. The training data can be viewed as a rowsets Txi={yi|(xi, yi)εT} that are sampled from some multinomial distribution with parameters P(yi). In terms of viewing, listening or buying histories, P(y) represents the probability that a viewer, listener or buyer will take the desired action on y. In general, it is beneficial (i.e., it would increase the total number of views or purchases and increase general user interest) to present y's having large values of P(y) to users.
The goal of the recommendation system is to produce a result set R={rm}⊂Y such that P(rm) is large. The elements of the set R are the recommended items. This is similar to the formulation described in, for example, T. Hofmann, J. Puzicha, and M. Jordan, Learning from dyadic data, 1999.
The dyadic learning framework appears in many applications. In text retrieval, the two kinds of objects are words and documents. The relation is whether a word appears in a document; recommendation is text retrieval with relevance feedback, as described, for example, in Jr. J. J. Rocchio, “Relevance feedback in information retrieval,” in Gerard Salton, editor, The SMART Retrieval System, pages 313-323, Prentice Hall, Englewood Cliffs, N.J., 1971. In music recommendation systems, the two kinds of objects are listeners and artists, and the relation represented is “who listened to whose music?”. In advertising analysis systems, the two kinds are consumers and vendors, with the relation being “who purchased from whom?” or (for web visitors and ads), “who clicked on what?”. In eCommerce systems, consumers and products are the kinds of objects, and the relation is “who bought what”. See, for example, Greg Linden, Brent Smith, and Jeremy York, “Amazon.com recommendations: Item-to-item collaborative filtering,” IEEE Internet Computing, 7(1):76-80, January/February 2003. All of these diverse systems are variants of the same general pattern in which objects of one kind interact with object of another in exactly one way.
Dyadic learning approaches in the past have generally fallen into the category of matrix-based, item-based, or nearest-neighbor techniques.
Matrix-Based Learning Techniques
Matrix-based techniques express subsets of X and Y as vectors with an element for each member of the set and length equal to the cardinality of the set. The training data T is expressed as a matrix with rows corresponding to elements of X, columns corresponding to elements of Y,
      t    xy    =      {                            k                                                    if              ⁡                              (                                  x                  ,                  y                  ,                  k                                )                                      ∈            T                                                0                          otherwise                    
The goal in matrix-based methods is to use T to find a function A, such that r=Az. In practice, Az is often more or less a matrix product, although it is only rarely implemented with an explicit matrix using a matrix algebra package. Traditional vector-based text retrieval [Sal91], as the name suggests, uses a suitably weighted and normalized version of T itself so that A=DdocTDterm where Ddoc and Dterm are diagonal matrices that perform document normalization or term weighting, respectively. See, for example, Gerard Salton, “Developments in automatic text retrieval,” in Science, 253:974-980, 1991.
Latent Variable Techniques
Latent variable techniques are matrix-based recommendation systems in which A is implemented using a reduced dimensionality decomposition of some kind. One such method is latent semantic analysis (LSA), as described in Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman, “Indexing by latent semantic analysis,” in Journal of the American Society of Information Science, 41(6):391-407, 1990. LSA was originally applied to text retrieval where the ki were defined as co-occurrence frequencies weighted by inverse document frequency and the matrix product used was the traditional product from linear algebra. The weighted term matrix A=DdocTDterm is decomposed using singular value decomposition,A≈UΣnV′where Σn contains only the largest n singular values.
Text retrieval with LSA is done by computing document scores (UΣnV′)z. LSA can be used in any dyadic learning problem and recommendation can be done with LSA by taking r as the largest elements of the vector A′Az=VΣU′UΣV′z=VΣ2V′z. Notably, this product can be computed in various orders. For latent variable techniques, it is common to use (VΣ)((ΣV′)z). This converts z to the reduced dimensional representation using (ΣV′)z and then converts back to items using VΣ.
HNC Software achieved similar results with their so-called context vectors, as described in William R. Caid, Susan T. Dumais, and Stephen I. Gallant, “Learned vector-space models for document retrieval,” in Information Processing and Management, 31(3):419-429, 1995. In the context vector system, X consisted of documents and Y of words. Words were assigned random values from Rn where n was typically chosen to be ˜300, and training epochs were conducted with words being assigned new vectors constructed from the vectors of the words occurring in the same document. The context vector system can be seen as similar to LSA since executing t steps of the HNC training algorithm was effectively computing X(t)=(A′A)tX(0) where X(0) is an n column matrix containing random initial word vectors as rows each normalized so that X(0)′X(0)=I. The relationship with LSA can be seen from
                    (                              A            ′                    ⁢          A                )            t        ⁢          X              (        0        )              =                              (                      V            ⁢                                                  ⁢            Σ            ⁢                                                  ⁢                          U              ′                        ⁢            U            ⁢                                                  ⁢            Σ            ⁢                                                  ⁢                          V              ′                                )                n            ⁢              X                  (          0          )                      =          V      ⁢                          ⁢              Σ                  2          ⁢          t                    ⁢              V        ′            ⁢              X                  (          0          )                    
As n gets large, the largest few singular values dominate all others so X(t)≈VΣ12tV′. The HNC group used a variety of early stopping rules or “repulsion”rules. These rules have the effect of partially orthogonalizing the resulting vectors in X(t), which forces the learning algorithm to find more than just the first singular vector. Analogous orthogonalization (with better understood effect) is used in Lanczos algorithm, described in, for example, Gene H. Golub and Charles F. Van Loan, Matrix Computations, Johns Hopkins Studies in Mathematical Sciences, The Johns Hopkins University Press, 3rd edition, 1996.
More recently, the original motivation for LSA in terms of smoothing the training matrix T has been reinterpreted in work by Buntine and Jakulin (discrete component analysis or DCA) and Hoffman, Ng, Blei and Jordan (probabilistic LSI or pLSI and latent Dirichlet allocation or LDA). See, for example, Wray Buntine and Aleks Jakulin, Discrete component analysis; W. Buntine, Applying discrete PCA in data analysis, 2004; Thomas Hofmann, “Probabilistic latent semantic analysis,” in Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999; and D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” in Journal of machine Learning Research 3, 2003.
In this work, a decomposition similar to that used in LSA is used, except that the singular vectors in U and V are interpreted as conditional probabilities of a hidden multinomial factor. The reduced representations for documents or words in these approaches are the parameters of the multinomial distribution of this hidden factor, and the metric of approximation is no longer the L2 norm but is instead based on log-likelihood or log-evidence. Where the singular vectors in LSA were taken as orthonormal vectors from Rd, in DCA, pLSI and LDA the representation vectors are restricted to the unit simplex. It is interesting to note that the probability that samples a and b from two multinomials with parameters θa and θb are equal is just
      p    ⁡          (                        a          =                      b            ❘                          θ              a                                      ,                  θ          b                    )        =                    ∑        x            ⁢                        p          ⁡                      (                          a              =                              x                ❘                                  θ                  a                                                      )                          ⁢                  p          ⁡                      (                          b              =                              x                ❘                                  θ                  b                                                      )                                =                  ∑        i            ⁢                        θ          ai                ⁢                  θ          bi                    
Similarly, if θa and θb are Dirichlet distributed with parameters αama and αbmb respectively, then
      p    ⁡          (                        a          =                      b            ❘                          α              a                                      ,                  m          a                ,                  α          b                ,                  m          b                    )        =            ∫                        p          ⁡                      (                                          a                =                                  b                  ❘                                      θ                    a                                                              ,                              θ                b                                      )                          ⁢                  p          ⁡                      (                                                            θ                  a                                ❘                                  α                  a                                            ,                              m                a                                      )                          ⁢                  p          ⁡                      (                                                            θ                  b                                ❘                                  α                  b                                            ,                              m                b                                      )                          ⁢                  ⅆ                      θ            a                          ⁢                  ⅆ                      θ            b                                =                  ∑        i            ⁢                        m          ai                ⁢                  m          bi                    
The fact that these probabilities can be expressed in the form of sums of products makes much of the machinery in a practical LDA or DCA system very similar to that of an LSA-based system. Moreover, LDA and DCA can be viewed as alternative methods for approximating the same posterior probabilistic representation, and the differences between these methods have primarily to do with how the conditional probabilities are estimated. The LDA work highlights the variational approach of Jordan and others in variational techniques while the DCA work makes use of Gibbs sampling. See, for example, Michael I. Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence K. Saul, “An introduction to variational methods for graphical models,” in Machine Learning, 37(2):183-233, 1999; and R. M. Neal, “Probabilistic inference using Markov chain Monte Carlo methods,” in Technical Report CRG-TR-93-1, University of Toronto, 1993. Jakulin and Buntine have provided a comprehensive framework that makes the relationships between these methods clear; see Buntine and Jakulin.
Nearest-Neighbor Recommendations
Nearest-neighbor techniques use various methods to find rows from the training data R(xj)={y|(xj, y)εT} that are similar to z. Frequently occurring elements of these neighboring rowsets are then used to form the recommendation set. Just as the matrix-based techniques require a suitable matrix product, nearest-neighbor techniques require a suitable similarity measure.
Nearest-neighbor recommendations thus produce item recommendations for a user by using the user's history to find similar users and then examining those users to determine which items they interacted with anomalously often. Early examples of nearest-neighbor techniques include the Ringo system for music recommendation and the GroupLens system. See, for example, Upendra Shardanand and Patti Maes, “Social information filtering: Algorithms for automating ‘word of mouth,’” in Proceedings of ACM CHI'95 Conference on Human Factors in Computing Systems, volume 1, pages 210-217, 1995; and P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl, “GroupLens: An Open Architecture for Collaborative Filtering of Netnews,” in Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, pages 175-186, Chapel Hill, N.C., 1994. ACM
Nearest-neighbor recommendation systems can be difficult to implement in real-time settings since the set of users tends to grow faster than the set of items. These difficulties were the motivation for groups adapting market basket analysis to produce item-based recommendation systems. Item-based recommendation systems are described below.
Item Based Recommendations
The item-based recommendation system as described by Linden et al. uses a pre-processing step to derive item-to-item recommendations. Then, when recommending items for a user, the item-to-item recommendation lists for the items in the history vector z are merged to produce a recommendation list for the user.
Item-based recommendation systems are, in fact, simply a special case of market basket analysis, as described in Michael J. A. Berry and Gordon Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, Wiley, 1st edition, 1997. ISBN-10: 0471179809. The key feature of all market basket analyses is the creation of item sets based on co-occurrence within the set of items that users have interacted with. In market basket analysis, the interaction is a purchase, but in other applications, the interaction might involve viewing a video or clicking on a search result. These item sets are combined by merging to produce a recommendation.
The methods used to produce item-sets generally reduce to the examination of the co-occurrence matrix T′T. This matrix is examined (in a sparse representation) to find elements that do not appear to be due to chance, usually by performing a statistical test on the elements. One suitable test of this sort makes use of the G2 statistic, as described in Ted E. Dunning, “Accurate methods for the statistics of surprise and coincidence,” in Computational Linguistics, 19(1):61-74, 1993. Examination of the weighted co-occurrence matrix A′A can provide similar results if the weighting factors are chosen appropriately. Those items in each row of the coocurence that are deemed most important are retained. The result is a sparse matrix S˜T′T that can be used to produce an item recommendation r=Sz.
The Problem: p-adic and Multitransitive Learning
In practice, the restriction to dyadic learning limits the capabilities of a recommendation system. As a concrete example, consider a situation where publishers add content items to a content management system. Viewers then may view these content items. Viewers may also forward content items to other viewers. Viewers may also add tags to content items. Exactly what kind of content we are dealing with in this example is not important; videos, pictures, music, books or other items are all plausible.
This example can be formalized by using the following two binary relations and two ternary relations,                publish(p,c)        view(v,c)        forward(v1,c,v2)        tag(v,c,t)        
where we use p, c, v, and t to represent members of the corresponding sets P, C, V and T of publishers, content items, viewers and tags respectively. We might also extend these relations to be integer-valued functions if we wish to record the number of times an event has happened over a particular interval of time. In either case, we can represent these relations as the two and three dimensional matrices Ppc, Vvc, Fv1cv2, and Tvct. A three-dimensional matrix is a data structure analogous to a conventional matrix, but with three indices rather than the more conventional two. Three dimensional matrices are sometimes referred to as data cubes.
As with dyadic learning, we consider the data in these matrices to be sampled from a distribution that we would like to estimate. Note, however, that because we have more than two sets of objects (we have four, in fact), a normal matrix cannot be used to represent the training data, nor can it reasonably be expected to be useful in estimating the distribution. Moreover, even if we restrict ourselves to considering a single relation, the forward and tag relations are not even binary and thus even in isolation, they cannot be well modeled by a normal matrix. In mathematical terminology, the learning problem is no longer dyadic, but is now tetradic (normally written 4-adic or p-adic in the general case) because we have four classes of objects. In linguistic terms, the forward and tag relations represent bitransitive verbs as opposed to the transitive verbs represented by publish and view. It is exactly this bitransitivity that makes normal matrix representations and all of the techniques based on them infeasible.
Even if this system were composed only of binary relations, however, it would be only be possible to represent the training data using adjoined matrices. For instance the publish and view relations above could be represented by two adjoined matrices,
  T  =      (                                        T            publish                                    0                                      0                                      T            view                                )  
This would allow recommendation methods based on decomposition to be applied to the adjoined training data,
                                 T          =                      (                                                                                                      U                      publish                                        ⁢                                          Σ                      publish                                        ⁢                                          V                      publish                      ′                                                                                        0                                                                              0                                                                                            U                      view                                        ⁢                                          Σ                      view                                        ⁢                                          V                      view                      ′                                                                                            )                                                                    =                                          (                                                                                                    U                        publish                                                                                    0                                                                                                  0                                                                                      U                        view                                                                                            )                            ⁢                              (                                                                                                    Σ                        publish                                                                                    0                                                                                                  0                                                                                      Σ                        view                                                                                            )                            ⁢                              (                                                                                                    V                        publish                                                                                    0                                                                                                  0                                                                                      V                        view                                                                                            )                                              ,                    
The co-occurrence matrix is also straightforward to compute with this system,
            T      ′        ⁢    T    =      (                                                      T              publish              ′                        ⁢                          T              publish                                                0                                      0                                                    T              view              ′                        ⁢                          T              view                                            )  
Unfortunately, this block-diagonal strategy of adjoining two dimensional matrices cannot represent certain relations. In particular, the item vectors that are derived by analyzing the publishing patterns have no relationship to the item vectors that are derived by analyzing the viewing patterns. Even worse, this strategy cannot be extended to the ternary relations such as tagging at all. This problem is inherent in the fact that two dimensional matrices have only two indices and thus are not at all suited to the representation of ternary relations.
What is needed is a system and method that avoids these limitations by providing for p-adic learning paradigms and thereby accounting for different types of relationships among entities. What is further needed is a recommendation system and method that provides improved scalability over prior art systems, without oversimplifying the data set.