The present invention relates in general to text mining and, in particular, to a system and method for efficiently generating cluster groupings in a multi-dimensional concept space.
Document warehousing extends data warehousing to content mining and retrieval. Document warehousing attempts to extract semantic information from collections of unstructured documents to provide conceptual information with a high degree of precision and recall. Documents in a document warehouse share several properties. First, the documents lack a common structure or shared type. Second, semantically-related documents are integrated through text mining. Third, essential document features are extracted and explicitly stored as part of the document warehouse. Finally, documents are often retrieved from multiple and disparate sources, such as over the Internet or as electronic messages.
Document warehouses are built in stages to deal with a wide range of information sources. First, document sources are identified and documents are retrieved into a repository. For example, the document sources could be electronic messaging folders or Web content retrieved over the Internet. Once retrieved, the documents are pre-processed to format and regularize the information into a consistent manner. Next, during text analysis, text mining is performed to extract semantic content, including identifying dominant themes, extracting key features and summarizing the content. Finally, metadata is compiled from the semantic context to explicate essential attributes. Preferably, the metadata is provided in a format amenable to normalized queries, such as database management tools. Document warehousing is described in D. Sullivan, xe2x80x9cDocument Warehousing and Text Mining, Techniques for Improving Business Operations, Marketing, and Sales,xe2x80x9d Chs. 1-3, Wiley Computer Publishing (2001), the disclosure of which is incorporated by reference.
Text mining is at the core of the data warehousing process. Text mining involves the compiling, organizing and analyzing of document collections to support the delivery of targeted types of information and to discover relationships between relevant facts. However, identifying relevant content can be difficult. First, extracting relevant content requires a high degree of precision and recall. Precision is the measure of how well the documents returned in response to a query actually address the query criteria. Recall is the measure of what should have been returned by the query. Typically, the broader and less structured the documents, the lower the degree of precision and recall. Second, analyzing an unstructured document collection without the benefit of a priori knowledge in the form of keywords and indices can present a potentially intractable problem space. Finally, synonymy and polysemy can cloud and confuse extracted content. Synonymy refers to multiple words having the same meaning and polysemy refers to a single word with multiple meanings. Fine-grained text mining must reconcile synonymy and polysemy to yield meaningful results.
In particular, the transition from syntactic to semantic content analysis requires a shift in focus from the grammatical level to the meta level. At a syntactic level, documents are viewed structurally as sentences comprising individual terms and phrases. In contrast, at a semantic level, documents are viewed in terms of meaning. Terms and phrases are grouped into clusters representing individual concepts and themes.
Data clustering allows the concepts and themes to be developed more fully based on the extracted syntactic information. A balanced set of clusters reflects terms and phrases from every document in a document set. Each document may be included in one or more clusters. Conversely, concepts and themes are preferably distributed over a meaningful range of clusters.
Creating an initial set of clusters from a document set is crucial to properly visualizing the semantic content. Generally, a priori knowledge of semantic content is unavailable when forming clusters from unstructured documents. The difficulty of creating an initial clusters set is compounded when evaluating different types of documents, such as electronic mail (email) and word processing documents, particularly when included in the same document set.
In the prior art, several data clustering techniques are known. Exhaustive matching techniques fit each document into one of a pre-defined and fixed number of clusters using a closest-fit approach. However, this approach forces an arbitrary number of clusters onto a document set and can skew the meaning of the semantic content mined from the document set.
A related prior art clustering technique performs gap analysis in lieu of exhaustive matching. Gaps in the fit of points of data between successive passes are merged if necessary to form groups of documents into clusters. However, gap analysis is computational inefficient, as multiple passes through a data set are necessary to effectively find a settled set of clusters.
Therefore, there is a need for an approach to forming clusters of concepts and themes into groupings of classes with shared semantic meanings. Such an approach would preferably categorize concepts mined from a document set into clusters defined within a pre-specified range of variance. Moreover, such an approach would not require a priori knowledge of the data content.
The present invention provides a system and method for generating logical clusters of documents in a multi-dimensional concept space for modeling semantic meaning. Each document in a set of unstructured documents is first analyzed for syntactic content by extracting literal terms and phrases. The semantic content is then determined by modeling the extracted terms and phrases in multiple dimensions. Histograms of the frequency of occurrences of the terms and phrases in each document and over the entire document set are generated. Related documents are identified by finding highly correlated term and phrase pairings. These pairings are then used to calculate Euclidean distances between individual documents. Those documents corresponding to concepts separated by a Euclidean distance falling within a predetermined variance are grouped into clusters by k-means clustering. The remaining documents are grouped into new clusters. The clusters can be used to visualize the semantic content.
An embodiment of the present invention is a system and a method for building a multi-dimensional semantic concept space over a stored document collection. A plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content are identified. A vector reflecting semantic similarities between substantially correlated documents at an angle xcex8 from a common axis in a concept space is generated. One or more clusters are formed at an angle "sgr" from the common axis in the concept space. Each cluster includes documents having such an angle xcex8 falling within a predefined variance of the angle "sgr" for the cluster. A new cluster is constructed at an angle "sgr" from the common axis in the concept space. Each new cluster includes documents having such an angle xcex8 falling outside the predefined variance of the angle "sgr" for the remaining clusters.
A further embodiment is a system and method for efficiently generating cluster groupings in a multi-dimensional concept space. A plurality of terms are extracted from each document in a collection of stored unstructured documents. A concept space is built over the document collection. Terms substantially correlated between a plurality of documents within the document collection are identified. Each correlated term is expressed as a vector mapped along an angle xcex8 originating from a common axis in the concept space. A difference between the angle xcex8 for each document and an angle "sgr" for each cluster within the concept space is determined. Each such cluster is populated with those documents having such difference between the angle xcex8 for each such document and the angle "sgr" for each such cluster falling within a predetermined variance. A new cluster is created within the concept space those documents having such difference between the angle xcex8 for each such document and the angle "sgr" for each such cluster falling outside the predetermined variance.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is described embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.