Document warehousing extends data warehousing to content mining and retrieval. Document warehousing attempts to extract semantic information from collections of unstructured documents to provide conceptual information with a high degree of precision and recall. Documents in a document warehouse share several properties. First, the documents lack a common structure or shared type. Second, semantically-related documents are integrated through text mining. Third, essential document features are extracted and explicitly stored as part of the document warehouse. Finally, documents are often retrieved from multiple and disparate sources, such as over the Internet or as electronic messages.
Document warehouses are built in stages to deal with a wide range of information sources. First, document sources are identified and documents are retrieved into a repository. For example, the document sources could be electronic messaging folders or Web content retrieved over the Internet. Once retrieved, the documents are pre-processed to format and regularize the information into a consistent manner. Next, during text analysis, text mining is performed to extract semantic content, including identifying dominant themes, extracting key features and summarizing the content. Finally, metadata is compiled from the semantic context to explicate essential attributes. Preferably, the metadata is provided in a format amenable to normalized queries, such as database management tools. Document warehousing is described in D. Sullivan, “Document Warehousing and Text Mining, Techniques for Improving Business Operations, Marketing, and Sales,” Chs. 1-3, Wiley Computer Publishing (2001), the disclosure of which is incorporated by reference.
Text mining is at the core of the data warehousing process. Text mining involves the compiling, organizing and analyzing of document collections to support the delivery of targeted types of information and to discover relationships between relevant facts. However, identifying relevant content can be difficult. First, extracting relevant content requires a high degree of precision and recall. Precision is the measure of how well the documents returned in response to a query actually address the query criteria. Recall is the measure of what should have been returned by the query. Typically, the broader and less structured the documents, the lower the degree of precision and recall. Second, analyzing an unstructured document collection without the benefit of a priori knowledge in the form of keywords and indices can present a potentially intractable problem space. Finally, synonymy and polysemy can cloud and confuse extracted content. Synonymy refers to multiple words having the same meaning and polysemy refers to a single word with multiple meanings. Fine-grained text mining must reconcile synonymy and polysemy to yield meaningful results.
In particular, the transition from syntactic to semantic content analysis requires a shift in focus from the grammatical level to the meta level. At a syntactic level, documents are viewed structurally as sentences comprising individual terms and phrases. In contrast, at a semantic level, documents are viewed in terms of meaning. Terms and phrases are grouped into clusters representing individual concepts and themes.
Data clustering allows the concepts and themes to be developed more fully based on the extracted syntactic information. A balanced set of clusters reflects terms and phrases from every document in a document set. Each document may be included in one or more clusters. Conversely, concepts and themes are preferably distributed over a meaningful range of clusters.
Creating an initial set of clusters from a document set is crucial to properly visualizing the semantic content. Generally, a priori knowledge of semantic content is unavailable when forming clusters from unstructured documents. The difficulty of creating an initial clusters set is compounded when evaluating different types of documents, such as electronic mail (email) and word processing documents, particularly when included in the same document set.
In the prior art, several data clustering techniques are known. Exhaustive matching techniques fit each document into one of a pre-defined and fixed number of clusters using a closest-fit approach. However, this approach forces an arbitrary number of clusters onto a document set and can skew the meaning of the semantic content mined from the document set.
A related prior art clustering technique performs gap analysis in lieu of exhaustive matching. Gaps in the fit of points of data between successive passes are merged if necessary to form groups of documents into clusters. However, gap analysis is computational inefficient, as multiple passes through a data set are necessary to effectively find a settled set of clusters.
Therefore, there is a need for an approach to forming clusters of concepts and themes into groupings of classes with shared semantic meanings. Such an approach would preferably categorize concepts mined from a document set into clusters defined within a pre-specified range of variance. Moreover, such an approach would not require a priori knowledge of the data content.