Prior-art term extraction methods include structure-based methods and statistically based methods.
In structure-based methods, the terms are extracted from documents based on the structures produced by finite-state automation parsing (e.g., Grefenstette, 1994), full syntactical parsing (e.g., Bourigault, 1993; Jacquemin, 1994; Strzalkowski, 1995) or deep semantic theory & analysis (e.g., Pustejovsky et al., 1993). The difficulties with this method are: i) it renders term extraction dependent on syntactic parsing or semantic analysis, which are generally more difficult problems; and ii) the terms are limited by the selected structures.
In statistically based methods, the terms are extracted statistically from a set of documents (e.g., Frantzi and Ananiadou, 1995; Church and Hanks, 1989; Dunning, 1993). In general, a term may be a word or a word string. In the statistical approach, a term holds two features. One is that it is statistically significant in the documents and the other is that its member words, if having more than one member word, are strongly associated. So, to determine whether a term candidate is a real term, its significance in the documents and the association between its member words must be considered.
As to the significance of a term candidate in the documents, existing methods all generally treat the given documents as a single mixed one, and take the terms extracted from this mixed document as the final result (e.g., Chien et. al., 1998; Schutze, 1998). This type of method implies a stringent requirement that given documents must be similar enough to make all the terms look statistically significant when the documents are all put together. But the given documents may not be very similar, and what is more, very similar documents are not easy to acquire for most domains. If the given documents are not very similar, existing solutions will fail to identify those terms which are only statistically significant in a few of the given documents, but not in the whole document collection.
The deeper reason for this problem is that existing methods don't take into account the infrastructure of the given documents. If the prior-art methods could specify the infrastructure and identify the document clusters from the infrastructure (where document clusters are subsets of the documents whose members are similar in some degree) then even if the given documents are not quite similar, they would not miss the terms hidden in the document clusters.
Thus, prior-art statistical solutions require the documents to be very similar to be effective; if given documents are not very similar, some terms may not be accessible by the statistical solutions.
It would be desirable to have a method in which the given documents need not be very similar. It would further be desirable to obtain a hierarchical classification of the given documents while extracting terms. Also, it would be desirable to be able to access the terms hidden in document clusters within the given document collection.