Large collections of documents have become increasingly available in electronically stored form due, in part, to the widespread adoption of computer-automated information and decision support systems. At the same time, electronically stored document collections have increasingly complemented and often supplanted traditional forms of printed communications. Electronically stored documents present several significant advantages over traditional printed formats, including efficient storage, rapid searchability, and facilitating immediate communication and publication over networking means, including the Internet.
From a pragmatic standpoint, the availability of electronically stored document collections has presented both a treasure and a curse to those seeking information discovery and retrieval. These types of document collections have expanded to include various forms of information classes, such as word processing documents, electronic mail, Worldwide Web (or simply “Web”) pages, spreadsheets, databases, and the like. And although now available in a highly searchable format, information embedded in documents stored in an electronic format must generally still be “mined” at a semantic level to discover and retrieve the data contained within. Mining out the semantic content of a document collection is essential to certain fields of endeavor, such as during the discovery phase of litigation. However, efficiently discovering and extracting such embedded semantic information can be an intractable problem, particularly when the size of the collection of documents is large.
Text mining is at the core of the information discovery process, and is described in D. Sullivan, “Document Warehousing and Text Mining, Techniques for Improving Business Operations, Marketing, and Sales,” Chs. 1-3, Wiley Computer Publishing (2001), the disclosure of which is incorporated by reference. Text mining involves the compiling, organizing and analyzing of document collections to support identification of types of information contained in the documents and to discover relationships between relevant facts. However, identifying relevant information can be difficult. First, extracting relevant content requires a high degree of precision and recall. Precision is the measure of how well the documents returned in response to a query actually address the query criteria. Recall is the measure of what should have been returned by the query. Typically, the broader and less structured the documents, the lower the degree of precision and recall. Second, analyzing an unstructured document collection without the benefit of a priori knowledge in the form of keywords and indices can present a potentially intractable problem space. Finally, synonymy and polysemy can cloud and confuse extracted content. Synonymy refers to multiple words having the same meaning and polysemy refers to a single word with multiple meanings. Fine-grained text mining must reconcile synonymy and polysemy to yield meaningful results.
Text mining is a significant first step in the overall process of discovering semantic meanings within a document collection. A further problem involves classifying the documents within a collection with respect to ad hoc categories of interest. For instance, during the discovery phase of litigation, documents must often be categorized into distinct groups, such as “relevant,” “non-relevant,” and “privileged.” Generally, the various documents falling into each group share certain characteristics, which can often be expressed as concepts and terms.
Similarly, categorizing the documents themselves into groups of related documents may be necessary as an aid to post-text mining document analysis. Text mining creates a multi-dimensional problem space that can be difficult to intuitively comprehend based on the presence of concepts and terms within the document collection overlapping by various degrees. Data visualization tools are available to display groups or “clusters” of documents, such as described in commonly-assigned U.S. Pat. No. 6,838,548, issued May 3, 2005, and U.S. Pat. No. 6,778,995, issued Aug. 17, 2004, and U.S. patent application Ser. No. 10/084,401, filed Feb. 25, 2002, pending, the disclosures of which are incorporated by reference. Data visualization tools enable a user to rapidly comprehend and pare down the potential search field within a document collection, based on extracted concepts and terms.
In the prior art, text mining is performed in two ways. First, syntactic searching provides a brute force approach to analyzing and extracting content based on literal textual attributes found in each document. Syntactic searching includes keyword and proximate keyword searching as well as rule-based searching through Boolean relationships. Syntactic searching relies on predefined indices of keywords and stop words to locate relevant information. However, there are several ways to express any given concept. Accordingly, syntactic searching can fail to yield satisfactory results due to incomplete indices and poorly structured search criteria.
A more advanced prior art approach uses a vector space model to search for underlying meanings in a document collection. The vector space model employs a geometric representation of documents using word vectors. Individual keywords are mapped into vectors in multi-dimensional space along axes representative of query search terms. Significant terms are assigned a relative weight and semantic content is extracted based on threshold filters. Although substantially overcoming the shortcomings of syntactic searching, the multivariant and multidimensional nature of the vector space model can lead to a computationally intractable problem space. As well, the vector space model fails to resolve the problems of synonymy and polysemy.
Therefore, there is a need for an approach to identifying semantic information within a document collection based on extracted concepts and terms. Preferably, such an approach would assign a score to each concept and term based on the inherent characteristics of each document and the overall document set.
There is a further need for an approach to clustering documents within a document collection with respect to similarities reflected by the scores assigned to the concepts and terms. Preferably, such an approach would accept a set of candidate seed documents for evaluation and initial clustering.