The present invention relates to data processing and more particularly, but not exclusively, relates to text analysis techniques.
Recent technological advancements have led to the collection of a vast amount of electronic data. These collections are sometimes arranged into corpora each comprised of millions of text documents. Unfortunately, the ability to quickly identify patterns or relationships which exist within such collections, and/or the ability to readily perceive underlying concepts within documents of a give corpus remain highly limited. Common text analysis applications include information retrieval, document clustering, and document classification (or document filtering). Typically, such operations are preceded by feature extraction, document representation, and signature creation, in which the textual data is transformed to numeric data in a form suitable for analysis. In some text analysis systems, the feature extraction, document representation, and signature creation are the same for all applications. The Battelle SPIRE system provides an example in which each document is represented by a numeric vector called the SPIRE ‘signature’; all SPIRE applications then work directly with this signature vector.
In other text analysis systems (e.g., IBM's Intelligent Miner for Text), approaches for feature extraction, document representation or signature creation vary with the application. Desired features often differ for document clustering and document classification applications. In classification, a ‘training’ set of documents with known class labels is used to ‘learn’ rules for classifying future documents; features can be extracted that show large variation or differences between known classes. In clustering, documents are organized into groups with no prior knowledge of class labels; features can be extracted that show large variation or clumping between documents; however, because ‘true’ class labels are unknown, they cannot be exploited for feature extraction.
While generic systems facilitate the layering of multiple applications once a generic ‘signature’ is obtained, it may not perform as well in specific applications as systems that were developed specifically for that application. In contrast, the disadvantage of specialized systems is that they require separate development of feature extraction, document representation, or signature creation algorithms for each application, which can be time consuming, and impractical for small research groups.
Furthermore, current schemes tend to group documents according to a unitary measure of semantic similarity; however, documents can be similar in different ‘respects’. For example, in an assessment of retrieval of aviation safety incident reports related to documents describing the Cali accident (M. W. McGreevy and I. C. Statler, NASA/TM-1998-208749), analysts judged incident reports as related or not to the Cali accident (based on NTSB investigative reports of the Cali accident) according to six different ‘respects’ exemplified by the questions asked of the analysis: (1) in some ways, the context of this incident is similar to the context of the Cali accident; (2) some of the events of this incident are similar to some of the events of the Cali accident; (3) some of the problems of this incident are similar to some of the problems of the Cali accident; (4) some of the human factors of this incident are similar to some of the human factors of the Cali accident; (5) some of the causes of this incident are similar to some of the causes of the Cali accident; and (6) in some ways, this incident is relevant to the Cali accident. Many existing systems do not account for these different dimensions of similarity.
Moreover, typical systems do not account for the confidence in observed relationships, the potential for multiple levels of meaning, and/or the context of observed relationships. Thus, there is an ongoing need for further contributions in this area of technology.