The present invention relates to a technique for measuring the importance of words or word sequences in a group of documents, and is intended for use in supporting document retrieval and automatic construction of a word dictionary among other purposes.
FIG. 1 illustrates a document retrieval system having windows for displaying “topic words” in the retrieved documents, wherein the window on the right side selectively displays words in the documents displayed on the left side. An example of such a system is disclosed, for example, in the Japanese Published Unexamined Patent Application No. Hei 10-74210, “Document Retrieval Supporting Method and Document Retrieving Service Using It” (Reference 1).
Kyo Kageura (et al.), “Methods of automatic term recognition: A review,” Terminology, 1996) (Reference 2) describes a method of calculating the importance of words. Methods to calculate the importance of words have long been studied with a view to automatic term extraction or facilitating literature searching by weighting words characterizing a desired document.
Words may be weighted either to extract important words from a specific document or to extract important words from all documents. The best known technique in connection with the former is tf-idf, where idf is the logarithm of the quotient of the division of the total number N of documents by the number N(w) of documents in which a certain word w occurs, while tf is the frequency of occurrence f(w, D) of the word in a document d; tf-idf, as the product of these factors, is represented by:f(w,d)×log2(N/N (w))
There are variations including the following square root of f(w, d): f(w, d)**0.5×log2 (N/N (w)). Whereas, there also are many other variations, tf-idf is set, as its basic nature, to become “greater as the word occurs more frequently and concentrates in a smaller number of documents.”
Though not stated in Reference 2, a natural method to expand this measure, instead of considering the importance of a word in a specific document, into a measure of the importance of the word in the set of all documents is to replace f(w, d) with f(w), which is the frequency of w in all documents.
One of the methods to extract important words from all documents is to measure the accidentalness of differences in the frequency of occurrence of each word from one given document category to another, and to qualify as important words what have a higher degree of non-accidentalness. The accidentalness of differences can be measured by several measures including the chi-square test, and this method requires the categorization of the document set in advance.
In a separate context from these studies, there have been a series of attempts to identify a collection of words (or word sequences) which qualify as important words (or word sequences) from the standpoint of natural language processing. In these studies, methods have been proposed by which words (or word sequences) to be judged as important are to be restricted by the use of grammatical knowledge together with the intensity of the co-occurrence of adjoining words assessed by various measures. As such measures, there are used (pointwise) mutual information, the log-likelihood ratio and so forth.