The invention relates generally to text mining, and more specifically to using multidimensional subspaces to represent semantic relationships that exist in a set of documents.
Text mining is an extension of the general notion of data mining in the area of free or semi-structured text. Data mining broadly seeks to expose patterns and trends in data, and most data mining techniques are sophisticated methods for analyzing relationships among highly formatted data, i.e., numerical data or data with a relatively small fixed number of possible values. However, much of the knowledge associated with an enterprise consists of textually-expressed information, including free text fields in databases, reports and other documents generated in the company, memos, e-mail, Web sites, and external news articles used by managers, market analysts, and researchers. This data is inaccessible to traditional data mining techniques, because these techniques cannot handle the unstructured or semi-structured nature of free text. Similarly, the analysis task is beyond the capabilities of traditional document management systems and databases. Text mining is a developing field devoted to helping knowledge workers find relationships between individual unstructured or semi-structured text documents and semantic patterns across large collections of such documents.
Research in text mining has its roots in information retrieval. Initial information retrieval work began around 1960, when researchers started to systematically explore methods to match users queries to documents in a database. However, recent advances in computer storage capacity and processing power coupled with massive increases in the amount of text available on-line have resulted in a new emphasis on applying techniques learned from information retrieval to a wider range of text mining problems. Concurrently, text mining has grown from its origins in simple information retrieval systems to encompass additional operations including: information visualization; document classification and clustering; routing and filtering; document summarization; and document cross-referencing. All of the text mining operations listed above share the common need to automatically assess and characterize the similarity between two or more pieces of text. This need is most obvious in information retrieval.
All information retrieval methods depend upon the twin concepts of document and term. A document refers to any body of free or semi-structured text that a user is interested in getting information about in his or her text mining application. This text can be the entire content of a physical or electronic document, an abstract, a paragraph, or even a title. The notion of a document also encompasses text generated from images and graphics or text recovered from audio and video objects. Ideally, a document describes a coherent topic. All documents are represented as collections of terms, and individual terms can appear in multiple documents. Typically, a term is a single word that is used in the text. However, a term can also refer to several words that are commonly used together, for example, xe2x80x9clanding gear.xe2x80x9d In addition, the terms that represent a piece of text may not appear explicitly in the text; a document""s terms may be obtained by applying acronym and abbreviation expansion, word stemming, spelling normalization, thesaurus-based substitutions, or many other techniques. Obtaining the best set of terms for a given document is dependent upon the document or the collection to which the document belongs and the particular goal of the text mining activity.
Once a suitable set of documents and terms have been defined for a text collection, various information retrieval techniques can be applied to the collection. These techniques can be grouped into four broad categories: keyword search methods, natural language understanding methods, probabilistic methods, and vector space methods. Each of these categories is discussed below.
Keyword search methods are currently the most basic and widely used technique in commercial information retrieval systems. The simplest keyword search method retrieves all documents that contain the exact words that are present in the user""s query. More advanced keyword search methods trade this cognitive simplicity for more powerful query specification languages, including the ability to specify Boolean operators, proximity operators, fuzzy matching operators, synonym lists, phonetic spellouts, and other term equivalence classes. The advantage of keyword search methods is that they are efficient to implement, very effective for a certain class of queries, and cognitively straightforward. However, keyword search methods have several disadvantages. First, because the technique relies solely on the matching of words, performance is highly dependent on the exact formulation of the query. Second, keyword search methods furnish only crude ways to determine the relevance of a document to the query. The most relevant document may be at the bottom of an enormous list. Third, keyword searching has problems due to the highly ambiguous nature of natural language. Relative to an ideal query response, keyword-based information retrieval systems can both overgenerate and undergenerate. Overgeneration occurs if the query terms have multiple meanings in the document set. Undergeneration occurs if relevant documents in the set happen to use synonyms for the query terms. Finally, keyword searching has limited applicability in general text mining. Keyword searching methods are typically effective only for information retrieval applications, and either do not address or only crudely apply to many of the other text mining operations listed above.
Natural language understanding methods are knowledge-intensive techniques that attempt to parse each sentence in the document collection and generate a semantic interpretation. The individual sentential interpretations can then be composed to yield highly detailed semantic characterizations of documents that can be used to support further text mining operations. Natural language understanding techniques typically require several types of detailed knowledge bases: lexical (information about the grammatical classes and meanings of individual words); syntactic (information about the grammar of expressions); semantic (information about how the expressions refer to the world); and pragmatic (information about the likely intentions and goals of the document author). Given sufficient knowledge bases, the natural language understanding approach to text mining can potentially find levels of detail that cannot be found by other methods. However, the difficulty of constructing and maintaining the required knowledge bases is a major drawback of this method. Knowledge bases of sufficient detail typically require a great deal of time and prior knowledge about the document collection and must be continually updated and checked for consistency. Further, the information in these knowledge bases ranges from fairly general world knowledge to completely domain specific facts and expert knowledge, and can easily involve probabilities, uncertainties, contradictions, and other difficult issues. In addition, even when provided with sufficient knowledge bases, natural language understanding algorithms are extremely slow compared to other text mining approaches. Therefore, natural language understanding methods are typically best suited for text mining applications requiring very detailed information from a specified document or small number of documents, and are not well suited for the analysis of large document collections.
The overall goal of probabilistic information retrieval methods is to estimate the probability that a document is relevant to a query, and to use this estimate to construct a relevance-ranked list of query matches. Probabilistic methods differ in how they model the relationship between documents and queries. The most frequently discussed probabilistic model is the binary independence model, although it is rarely used in practice. Difficulties in parameter estimation due to lack of training data plague most probabilistic methods, including those based on the binary independence model. Recently, Bayesian probabilistic models have been proposed that can overcome this difficulty while retaining the advantage of a solid theoretical grounding. Rather than requiring a single point of estimation for each of the model parameters, Bayesian models allow the system designer to specify prior distributions for these parameters. According to Bayesian statistical theory, these distributions should embody any prior knowledge known about these quantities, and can incorporate knowledge about other queries, terms in the query, and term frequency and distribution in the document set. Further, as the system operates and more knowledge is gained, the parameter distributions can be systematically updated and revised to obtain posterior distributions. In this way, probabilistic information retrieval methods based on Bayesian models can naturally incorporate relevance feedback data. The primary advantage of probabilistic methods for text mining is that they are grounded in statistical theory, and Bayesian probabilistic models in particular, are promising for certain information retrieval tasks. However, because these methods generally focus only on the information retrieval, routing and filtering, and classification aspects of text mining, they have somewhat limited applicability to other text mining operations, such as visualization.
Traditional vector space models treat individual documents as vectors in a high-dimensional vector space in which each dimension corresponds to some feature of a document. A collection of documents can therefore be represented by a two-dimensional matrix Afxd of features and documents. In the typical case, the features correspond to document terms, and the value of each feature is the frequency of that term in the specified document. For example, if term t1 occurs four times in document d1, then A(1,1) is set to 4. Similarly, if term t2 does not occur in d1, then A(2,1) is set to 0. More complex types of vector space methods, such as latent semantic indexing (LSI), involve different methods to compute the elements in A, e.g. singular value decomposition (SVD) or semi-discrete decomposition (SDD), typically attempting to provide a more sophisticated set of features and a better measure of the importance of each feature in a document.
By representing documents as vectors in a feature space, similarity between documents can be evaluated by computing the distance between the vectors representing the documents. The cosine measure is commonly used for this purpose, but other distance measures can be used. To use the vector space model as a method for information retrieval, a user""s query is represented as a vector in the same space as the document vectors. The distance between the query vector and each of the document vectors is computed, and the documents that are closest to the query are retrieved.
The advantages of the vector space method are that it provides a simple and uniform representation of documents and queries, can accommodate many variations appropriate to different document collections, and it has been shown to perform well in information retrieval applications. In addition, representing documents as vectors could be useful for all other text mining operations. However, the performance of the basic vector space model is severely limited by the size of A. In actual document collections, both the number of documents and the number of terms are typically quite large, resulting in a large A, and making the necessary distance calculations prohibitively slow. It is possible to alleviate this problem by preselecting a subset of all possible terms to use in the matrix, but this can degrade information retrieval performance and limits text mining capability. Finally, while the vector space model provides a way of assessing the similarities between pieces of text, it alone does not provide a good way to visualize these relationships or summarize documents.
Accordingly, there exists a need in the art of text mining for a method that addresses all text mining operations in a reasonably fast manner over a large data set without requiring prior knowledge of the data. The method should be easily extensible to changing document collections. Additionally, the method should provide a mechanism for labeling the data with meaningful terms that characterize the semantic dimensions of the document collection, as well as supporting visual exploration of the data.
The present invention is directed to a method, a system, and computer readable medium for representing a document collection using a subspace projection based on the distribution of the occurrence of terms in the documents of the document collection by: constructing a term frequency matrix; statistically transforming the term frequencies in the term frequency matrix using a two-sided orthogonal decomposition and identifying the significant features to determine a term subspace and a document subspace; and performing a projection into a lower dimensional subspace. The projection can either be a projection of documents into the term subspace or a projection of terms terms into the document subspace.
In accordance with other aspects of the invention, the term frequency matrix is constructed by defining the term frequency matrix with each row representing a term occurring in one or more of the documents in the document collection, and each column representing one of the documents in the document collection. Each entry in the term frequency matrix represents the number of occurrences of the term for the document.
In accordance with yet other aspects of the invention, the terms used in the term frequency matrix are determined by a tokenizing policy that tokenizes the document and removes stopwords from the list of tokens. Additional steps can optionally be performed including: removing low frequency words, acronym expansion, abbreviation expansion, other term normalization, and stemming.
In accordance with further aspects of the invention, when the document collection is a dynamically changing document collection, the subspace is updated as the document collection changes. The subspace is updated by identifying the new documents and new terms in the new documents, constructing a new term frequency matrix, statistically transforming the term frequencies in the same way that the initial document collection was transformed, projecting the term frequency matrix on the term subspace, computing the residual, augmenting the term subspace with the normalized residual, expanding the document subspace, and re-identifying significant features in the subspace.
In accordance with yet further aspects of the invention, identifying the new terms in the documents of an updated collection uses the same policies to determine the terms as was used for determining the terms of the initial document collection. This includes tokenizing the document, removing stopwords, removing low frequency words, performing acronym expansion, performing abbreviation expansion, performing other term normalization, and performing stemming according to the policies used during the determination of the initial document collection.
In accordance with still further aspects of the invention, statistically transforming term frequencies includes: determining a document length; determining a relative term frequency for the occurrence by adjusting the raw frequency of occurrence of each term by the document length; performing an operation on the relative term frequencies to make high frequency and low frequency terms more comparable; as an optional step, centering the data around the origin by subtracting the row average from each term frequency.
In accordance with yet other aspects of the invention, projecting the documents or terms into a lower dimension subspace is accomplished by determining the number of dimensions to use; performing a truncated two-sided orthogonal decomposition of the term frequency matrix according to the determined number of dimensions; and projecting the documents or terms into the lower dimensional subspace. Preferably, the truncated two-sided orthogonal decomposition is a truncated URV decomposition.
In accordance with still other aspects of this invention, verbal semantics for a dimension, i.e., words or terms describing the meaning of that dimension, can be provided by: identifying the column in the term frequency that corresponds to the dimension;, identifying terms in the column with the largest absolute values for both positive and negative entries; returning the largest positive and negative entries as a contrast set describing the meaning of the position along the dimension.
In accordance with further aspects of the invention, document summarization can be performed for an individual document, i.e., a few words or terms summarizing the contents of a document or indicating its topical affiliations, can be provided by: projecting the document into a subspace; projecting the document projection back into a term space; identifying a plurality of terms with the largest entries; and returning the identified terms as the document summarization.
In accordance with yet further aspects of the invention, document summarization can be performed for a group of documents, i.e., a few words or terms summarizing the contents of the group of documents, can be provided by: projecting the group of documents into a subspace; finding the centroid of the group of documents in the subspace; projecting the centroid back into a term space; identifying terms with the largest entries; and returning the identified terms as a summarization of the group of documents.
In accordance with yet further aspects of the invention, information visualization can be performed. Information visualization is performed in response to a user requesting information visualization. Information visualization is performed by computing the requested dimensions, determining default dimensions if the user has requested fewer than three dimensions (or whatever number of dimensions are being used for visualization), ensuring that the dimensions are orthogonalized if the user so desires, projecting documents onto the resulting dimensions, generating legends for the dimensions, and displaying the documents on labeled axes corresponding to the dimensions.
In accordance with yet further aspects of the invention, information retrieval can be performed upon user request. Information retrieval is performed by constructing a term frequency query vector, statistically transforming the term frequencies in the same way that the terms in the original document collection were transformed, projecting the vector into the subspace, determining similarity by measuring the distance between the query and the documents, and returning ranked matching documents. The term frequency vector is constructed by tokenizing using the same policy as was used for the original document collection, performing any optional functions using the same policy as was used for the original document collection, and identifying the terms in the original term frequency matrix representing the original document collection.
In accordance with still other aspects of the invention, document cross-referencing can be performed by segmenting the documents of the document collection into a plurality of suitably sized units, constructing a subspace representation for the units from all or a subset of the documents, projecting all the units into that subspace, determining similarity by measuring the distance between the projections of the units from one or more documents and the projections of the units in one or more target documents, and returning ranked matching units as potential cross-reference sections.
In accordance with still other aspects of the present invention, other text mining operations, including document clustering, term clustering, term classification, and document classification, can be performed by using the subspace projections of documents or terms as the input to a variety of clustering or classification algorithms.