A vast amount of information exists in text form, such as free, unstructured, or semi-structured text, including many database fields, reports, memos, email, web sites, and news articles. This information can be of interest to a variety of individuals and entities, such as business managers, market analysts, researchers, private companies, public companies and government agencies. As a result, methods have been developed to analyze textual resources.
Text analysis, or text data analysis, can involve such functions as document classification, document clustering, information visualization, text or document summarization, and document cross-referencing. Text data analysis has adopted certain techniques from the more general field of data analysis, including sophisticated methods for analyzing relationships among highly formatted data, such as numerical data or data with a relatively small fixed number of possible values. Such techniques can expose patterns and trends in the data. Text data analysis can identify relationships between individual free, unstructured or semi-structured text documents, as well as more general semantic patterns across large collections of such documents.
Some methods of text data analysis depend on the twin concepts of “document” and “term.” As used in this disclosure, a “document” refers to any body of free, unstructured or semi-structured text. The text can include the entire content of a document in its general sense, such as a book, an article, a paper, or the like—or a portion of a traditional document, such as an abstract, a paragraph, a sentence, or a phrase, for example, a title. Ideally, a “document” describes a coherent topic. In addition, a “document” can encompass text generated from an image or other graphics, as well as text recovered from audio or video formats.
On the other hand, a document can be represented as a collection of “terms,” each of which can appear in multiple documents. In some cases, a “term” can consist of an individual word used in the text. However, a “term” can also include multiple words that are commonly used together, for example, “landing gear.” Furthermore, the “terms” used to represent a portion of text need not explicitly appear in the text. Thus, the “terms” selected to represent a document may be obtained, for example, by applying acronym or abbreviation expansion, word stemming, spelling normalization, thesaurus-based substitutions, or many other techniques. Selecting the best set of “terms” to represent a given document generally will depend upon the particular document, or a collection to which the document belongs, as well as the specific goals of the text analysis activity.
In vector-based methods of text data analysis, after a suitable set of terms has been defined for a document collection, the collection can be represented as a vector space. With respect to traditional vector space methods, individual documents are treated as vectors in a high-dimensional vector space in which each dimension corresponds to some feature of a document. A collection of documents can thus be represented by a two-dimensional matrix D(t,d) of features and documents. In the typical case, the features correspond to document terms, and the value of each term is the number of occurrences of that term in the specified document. For example, if term t1 occurs four times in document d1, then D(1,1) is set to 4. Similarly, if term t2 does not occur in d1, then D(2,1) is set to 0. More complex types of vector space methods, such as latent semantic indexing (LSI), involve methods of transforming D, e.g. singular value decomposition (SVD) or semi-discrete decomposition (SDD), which typically attempt to provide a more sophisticated set of features and a better measure of the importance of each feature in a document.
By representing documents as vectors in a feature space, similarities between documents can be evaluated by computing the distance between the vectors representing the documents. A cosine measure is commonly used for this purpose, but other distance measures can be used.
The advantages of the vector space method are that it provides a simple and uniform representation of documents and queries, can accommodate many variations appropriate to different document collections, and has been shown to perform relatively well in information retrieval applications. In addition, representing documents as vectors could be useful for other text analysis operations. However, the performance of the basic vector space method is severely limited by the size of D. In actual document collections, both the number of documents and the number of terms are typically quite large, resulting in a large D, and making the necessary distance calculations prohibitively slow. It is possible to alleviate this problem by preselecting a subset of all possible terms to use in the matrix, but this can degrade information retrieval performance and limit text analysis capability. Finally, while the traditional vector space method provides a way of assessing the similarities between documents, it alone does not provide a good way to visualize these relationships or summarize documents.
As disclosed in U.S. Pat. No. 6,611,825, an improved vector space method has been developed that allows the user to efficiently perform a variety of text analysis operations including information retrieval, term and document visualization, term and document clustering, term and document classification, summarization of individual documents in groups of documents, and document cross-referencing. In this technique, the document collection is represented using a subspace transformation based on the distribution and/or number of the occurrence of terms in the documents of the document collection. In particular, a term-by-document frequency matrix D is initially constructed that catalogs the frequencies of the various terms for each of the documents.
In addition, the term-by-document matrix can be preprocessed to define a working matrix A by normalizing the columns of the term-by-document matrix D to have a unit sum, stabilizing the variance of the term frequencies via a nonlinear function and then centering the term frequencies with respect to the mean vector of the columns. This preprocessing is denoted as A=ƒ(D)−ceT in which c is the mean of the columns of ƒ(D) and e is a d-vector whose components are all equal to one, so that the average of the columns of A is now zero. Each ijth entry in A is therefore a score indicating the relative occurrence of the ith term in the jth document. Traditionally, ƒ is defined as a two-sided weighting function, i.e.,ƒ(D)=(WtD)Wd wherein Wt and Wd are two diagonal scaling matrices for weighing terms and documents, respectively, as known to those skilled in the art.
To capture some of the semantics latent in the documents, i.e., to capture similarity of content despite variations in word usage such as the use of synonyms, the working matrix A is orthogonally decomposed to obtain a rank-k matrix Ak that approximates A. In this regard, the orthogonal decomposition of the working matrix A can be performed with a number of decompositional techniques, such as a two-sided orthogonal decomposition.
By way of example, one typically two-sided orthogonal decomposition is a truncated URV (TURV) decomposition. For a given dimensionality k, the TURV computes bases of subspaces with high information content (matrices Uk and Vk with orthonormal columns) satisfying the equation:AVk=UkRk,wherein Rk is a triangular matrix of order k. Then an approximate term-document matrix Ak is defined as:Ak=UkRkVkT 
For the approximation Ak, as well as for A, each row corresponds to a term and each column corresponds to a document. The ijth entry of Ak therefore provides a relative occurrence of the ith term in the jth document, but this relative occurrence has now been filtered by the approximation which captures semantics latent in the documents. More specifically, the factor Uk captures variations in vocabulary, while the factor VkT brings out latent structure in the corpus of documents.
As disclosed in U.S. Pat. No. 6,611,825, one can summarize a document by returning all the words with a weight above some threshold in the column corresponding to that document (or if the document is not in the original collection, it can be projected into the subspace obtaining a vector analogous to the column vectors of Ak). However, some of these words will not be in the document itself. A person viewing a document will often want to see which words in the document best capture the intrinsic topics of the document or which words in the document are responsible for it being returned by a query. In the case of a keywords query, this is trivial, since in order for it to be returned, the keywords used in the query must occur in the document. However, if one has used the subspace method of document retrieval, then some or all of the words of the query may not occur in the document returned.
Accordingly, it is desirable to provide a method and apparatus that can be used in conjunction with a multidimensional subspace representation of an existing set of documents to analyze an individual document, identify relevant terms and summarize the individual document contents in relation to the content of the existing set of documents or in relation to a specific query.