1. Field of the Invention
The present disclosure relates to computerized analysis of documents and, in particular, to the efficient and compact construction and representation of the levels of similarity among documents from among a set of documents. The disclosure further relates to using the compact representation of similarity in training a model for analyzing document relevance.
2. Background Information
Many modern applications involving the analysis or manipulation of free-text information objects, such as documents, depend on constructing and using an abstraction of the contents of the information objects. Applications such as document classification or filtering, for example, may use a representation of the class or desired topic that is based on a set (or vector) of terms extracted from a set of documents that exemplify the class or topic. Many techniques take advantage of machine learning and statistical methods applied to the problem of learning the characteristic features of a set of examples representative of a class or topic, often referred to as a “training set,” in part by constructing a data structure known in the art as a “similarity matrix” or “kernel matrix.” A similarity matrix is a table of values reflecting the levels of similarity between pairs of documents for all documents in the training set.
Some advanced techniques for the creation of classifiers or filters model both the positive exemplars and the negative exemplars of a topic, using a sample of the “true” (on-topic) and “false” (not-on-topic) documents to create a training set. One technique, called “support vector machines” (SVMs), models or characterizes the margin of separation between the positive and negative examples in a training set as a function of the combinations of the term vectors of each document. The optimal margin is discovered in a series of steps, specific to each specific SVM algorithm. In order to facilitate the calculation of a margin, a similarity matrix (kernel matrix) of all the documents in the training set is constructed and used repeatedly.
A similarity matrix is conventionally created by computing all the respective pair-wise similarity values for the entire set of example documents in the training set used by a given learning algorithm. After the similarity matrix has been constructed, the entries of the matrix have to be stored in some manner for further use (on disk or in memory, especially if quick access is needed, for instance, during the learning procedure). For large sets of training examples, both the storage (e.g., the amount of random access memory necessary to hold the matrix) and the computation process (e.g., the CPU cycles) require significant resources. The minimization of such resources represents an important and challenging problem.