Data mining broadly seeks to expose patterns and trends in data, and most data mining techniques are sophisticated methods for analyzing relationships among highly formatted data, such as numerical data or data with a relatively small fixed number of possible values. However, a vast amount of information consists of textually-expressed information, including many database fields, reports, memos, e-mail, web sites, and external news articles of interest to managers, market analysts, and researchers.
Text mining is an extension of the general notion of data mining in the area of free or semi-structured text. In comparison to data mining, text data analysis (also referred to as “text mining,” or simply “text analysis”) refers to the analysis of text, and may involve such functions as text summarization, information visualization, document classification, document clustering, document summarization, and document cross-referencing. Thus, text data analysis may help a knowledge worker find relationships between individual unstructured or semi-structured text documents and semantic patterns across large collections of such documents.
Research in the area of text mining has its roots in information retrieval, which began around 1960, when researchers started to systematically explore methods to match user queries to documents in a database. However, recent advances in computer storage capacity and processing power, coupled with massive increases in the amount of text available on-line, have resulted in a new emphasis on applying techniques learned from information retrieval to a wider range of text mining problems. Generally speaking, text mining requires the ability to automatically assess and characterize the similarity between two or more sources of text.
In general, text mining depends on the twin concepts of “document” and “term.” As used in this disclosure, a “document” refers to any body of free or semi-structured text. The text can include the entire content of a document in its general sense, such as a book, an article, a paper, or the like or a portion of a traditional document, such as an abstract, a paragraph, a sentence, or a phrase, for example, a title. Ideally, a “document” describes a coherent topic. In addition, a “document” can encompass text generated from an image or other graphics, as well as text recovered from audio or video formats.
On the other hand, a document can be represented as collections of “terms,” each of which can appear in multiple documents. In some cases, a “term” can consist of an individual word used in the text. However, a “term” can also include multiple words that are commonly used together, for example, “landing gear.” Furthermore, the “terms” used to represent a portion of text need not explicitly appear in the text. Thus, the “terms” selected to represent a document may be obtained, for example, by applying acronym or abbreviation expansion, word stemming, spelling normalization, thesaurus-based substitutions, or many other techniques. Selecting the best set of “terms” to represent a given document generally will depend upon the particular document, or a collection to which the document belongs, as well as the specific goals of the text mining activity.
Once a suitable set of terms has been defined for a document collection, the collection can be represented as a vector space. With respect to traditional vector space methods, individual documents are treated as vectors in a high-dimensional vector space in which each dimension corresponds to some feature of a document. A collection of documents can thus be represented by a two-dimensional matrix D(t,d) of features and documents. In the typical case, the features correspond to document terms, and the value of each term is the frequency of that term in the specified document. For example, if term t1 occurs four times in document d1, then D(1,1) is set to 4. Similarly, if term t2 does not occur in d1, then D(2,1) is set to 0. More complex types of vector space methods, such as latent semantic indexing (LSI), involve methods of transforming D, e.g. singular value decomposition (SVD) or semi-discrete decomposition (SDD), which typically attempt to provide a more sophisticated set of features and a better measure of the importance of each feature in a document.
By representing documents as vectors in a feature space, similarities between documents can be evaluated by computing the distance between the vectors representing the documents. A cosine measure is commonly used for this purpose, but other distance measures can be used.
The advantages of the vector space method are that it provides a simple and uniform representation of documents and queries, can accommodate many variations appropriate to different document collections, and has been shown to perform relatively well in information retrieval applications. In addition, representing documents as vectors could be useful for all other text mining operations. However, the performance of the basic vector space method is severely limited by the size of D. In actual document collections, both the number of documents and the number of terms are typically quite large, resulting in a large D, and making the necessary distance calculations prohibitively slow. It is possible to alleviate this problem by preselecting a subset of all possible terms to use in the matrix, but this can degrade information retrieval performance and limit text mining capability. Finally, while the traditional vector space method provides a way of assessing the similarities between documents, it alone does not provide a good way to visualize these relationships or summarize documents.
As disclosed in U.S. Pat. No. 6,611,825, an improved vector space method has been developed that allows the user to efficiently perform a variety of text mining operations including information retrieval, term and document visualization, term and document clustering, term and document classification, summarization of individual documents in groups of documents, and document cross-referencing. In this technique, the document collection is represented using a subspace transformation based on the distribution of the occurrence of terms in the documents of the document collection. In particular, a term-by-document frequency matrix D is initially constructed that catalogs the frequencies of the various terms for each of the documents.
Next, the term-by-document matrix can be preprocessed to define a working matrix A by normalizing the columns of the term-by-document matrix D to have a unit sum, stabilizing the variance of the term frequencies via a nonlinear function and then centering the term frequencies with respect to the mean vector of the columns. This preprocessing is denoted as A=ƒ(D)−ceT in which c is the mean of the columns of ƒ(D) and e is a d-vector whose components are all equal to one, so that the average of the columns of A is now zero. Each ijth entry in A is therefore a score indicating the relative occurrence of the ith term in the jth document. Traditionally, ƒ is defined as a two-sided weighting function, i.e.,ƒ(D)=(WtD)Wd wherein Wt and Wd are two diagonal scaling matrices for weighing terms and documents, respectively, as known to those skilled in the art.
To capture some of the semantics latent in the documents, i.e., to capture similarity of content despite variations in word usage such as the use of synonyms, the working matrix A is orthogonally decomposed to obtain a rank-k matrix Ak that approximates A. In this regard, the orthogonal decomposition of the working matrix A can be performed with a number of decompositional techniques, such as a two-sided orthogonal decomposition.
By way of example, one typical two-sided orthogonal decomposition is a truncated URV (TURV) decomposition. For a given dimensionality k, the TURV computes bases of subspaces with high information content (matrices Uk and Vk with orthonormal columns) satisfying the equation:AVk=UkRk,wherein Rk is a triangular matrix of order k. Then an approximate term-document matrix Ak is defined as:Ak=UkRkVkT 
For the approximation Ak, as well as for A, each row corresponds to a term and each column corresponds to a document. The ijth entry of Ak therefore provides a relative occurrence of the ith term in the jth document, but this relative occurrence has now been filtered by the approximation which captures semantics latent in the documents. More specifically, the factor Uk captures variations in vocabulary, while the factor VkT brings out latent structure in the corpus of documents.
Accordingly, it is desirable to provide a method and apparatus that can be used in conjunction with a multidimensional subspace representation of an existing set of documents to analyze newly arriving streaming text data with regard to the content of the existing set of documents.