The present application relates to systems and methods for classifying documents. Learning preferences among a set of objects (e.g. documents) given another object as query is a central task of information retrieval and text mining. One of the most natural frameworks for this task is the pairwise preference learning, expressing that one document is preferred over another given the query. Most existing methods learn the preference or relevance function by assigning a real valued score to a feature vector describing a (query, object) pair. This feature vector normally includes a small number of hand-crafted features, such as the BM25 scores for the title or the whole text, instead of the very natural raw features. A drawback of using hand-crafted features is that they are often expensive and specific to datasets, requiring domain knowledge in preprocessing. In contrast, the raw features are easily available, and carry strong semantic information (such as word features in text mining).
Polynomial models (using combination of features as new features) on raw features are powerful and are easy to acquire in many preference learning problems. However, there are usually a very large number of features, make storing and learning difficult. For example, a basic model which uses the raw word features under the supervised pairwise preference learning framework and consider feature relationships in the model. In this model, D be the dictionary size, i.e. the size of the query and document feature set, given a query qεRD and a document dεRD, the relevance score between q and d is modeled as:
                                          f            ⁡                          (                              q                ,                d                            )                                =                                                    q                T                            ⁢              Wd                        =                                          ∑                                  i                  ,                  j                                            ⁢                                                W                  ij                                ⁢                                  Φ                  ⁡                                      (                                          q                      ,                                              d                        j                                                              )                                                                                      ,                            (        1        )            where Φ(qi,dj)=qi·dj and Wij models the relationship/correlation between ith query feature qi and jth document feature dj. This is essentially a linear model with pairwise features Φ(.,.) and the parameter matrix WεRD×D is learned from labeled data. Compared to most of the existing models, the capacity of this model is very large because of the D2 free parameters which can carefully model the relationship between each pair of words. From a semantic point of view, a notable superiority of this model is that it can capture synonymy and polysemy as it looks at all possible cross terms, and can be tuned directly for the task of interest.
Although it is very powerful, the basic model in Eq. (1) suffers from the following weakness which hinders its wide application:
1. Memory requirement: Given a dictionary size D, the model requires a large amount of memory to store the W matrix with a size quadratic in D. When D=10,000, storing W needs nearly 1 Gb of RAM (assuming double); when D=30,000, W storage requires 8 Gb of RAM.
2. Generalization ability: Given D2 free parameters (entries of W), when the number of training samples is limited, it can easily lead to overfitting. Considering the dictionary with the size D=10,000, then D2=108 free parameters that need to be estimated which is far too many for small corpora.
Recently researchers found out that raw features (e.g. words for text retrieval) and their pairwise features which describe relationships between two raw features (e.g. word synonymy or polysemy) could greatly improve the retrieval precision. However, most existing methods can not scale up to problems with many raw features (e.g. English vocabulary), due to the prohibitive computational cost on learning and the memory requirement to store a quadratic number of parameters.
Since such models are not practical, present systems often create a smaller feature space by dimension reduction technologies such as PCA. When raw features are used, polynomial models are avoided. When the polynomial models are used, various approaches can be used, including:                1. Sparse model: remove the parameters that are less important. However, empirical studies on very large sparse models are lacking        2. Low rank approximation: try to decompose the relationship matrix.        3. Hashing: try to put the big number of parameters into a smaller number of bins.        
In a related trend, unsupervised dimension reduction methods, like Latent Semantic Analysis (LSA) have been widely used in the field of text mining for hidden topic detection. The key idea of LSA is to learn a projection matrix that maps the high dimensional vector space representations of documents to a lower dimensional latent space, i.e. so called latent topic space. However LSA could not provide a clear and compact topic-word relationship due LSA projects each topic as a weighted combination of all words in the vocabulary.
Two existing models are closely related to LSA and have been used to find compact topic-word relationships from text data. Latent Dirichlet Allocation (LDA) provides a generative probabilistic model from Bayesian perspective to search for topics. LDA can provide the distribution of words given a topic and hence rank the words for a topic. However LDA could only handle a small number of hidden topics. Sparse coding, as another unsupervised learning algorithm, learn basis functions which capture higher-level features in the data and has been successfully applied in image processing and speech recognition. Sparse coding could provide a compact representation between the document to topics, but could not provide a compact represent between topic to words, since topics are learned basis functions associated to all words.