1. Field of the Invention
The present invention relates to a document processing apparatus for processing various types of documents, a word extracting apparatus for extracting a word from a text item including plural words, a word extracting method used in the document processing apparatus, and a storage medium for storing a word extracting program, and in particular relates to a document processing apparatus for calculating the degree of association between words, a word extracting apparatus for extracting a word in accordance with the degree of association between words, a word extracting method used in the document processing apparatus for calculating the degree of association between words, and a storage medium for storing a word extracting program for extracting a word in accordance with the degree of association between words.
2. Discussion of the Related Art
In a retrieval system which deals with an enormous amount of documents, a retrieval method using keywords is generally adopted. As an arbitrary keyword (retrieval word) is inputted to the retrieval system as a retrieval condition, all the documents including the keyword in their contents are obtained as a result of retrieval. The retrieval according to this method is called a full text search. Also, another method is widely used in which one or more keywords for retrieval are added to each document in advance and the document having the keywords one of which matches an inputted retrieval word is regarded as a result of retrieval.
However, no more than the documents including the word completely matching a retrieval word inputted by a user or the documents to which the word is added as the keyword and completely matches a keyword inputted by a user may be obtained by the above-described retrieval systems.
In such retrieval systems, accordingly, complete match between the retrieval word and the keyword is required and it is impossible to obtain all the documents pursued by the user. Therefore, as proposed by Japanese Patent Application Laid-Open No. 2-297290 (1990), a method is adopted, which presents associate words of the retrieval word to the user based on an associate word dictionary and recommends preparation of a retrieval expression closer to the purpose of retrieval for preventing oversight in retrieval.
For example, if the retrieval word inputted by a user is "SGML", the words "HTML", "ODA", "structured document" and so forth are acquired as the associate words of "SGML" from the associate word dictionary and offered to the user. The associate words determined to be appropriate by the user are connected with "SGML" by OR to execute retrieval, and thereby the possibility of oversight in retrieval is reduced.
A great deal of manpower is required for manual operation to prepare the associate word dictionary; consequently, a method has been suggested for automatically acquiring the associate words by calculation based on the contents of the document to be the object of retrieval. This is to acquire a word associating with another word by the statistical processing on the basis of frequency information of a word appearing in the retrieval object document.
For calculating the associate words, mutual information, Dice-coefficient and t-score are mainly used as statistical values. The mutual information (MI), Dice-coefficient (DC) and t-score (TS) between the words word1 and word2 are defined as follows. EQU MI(word1, word2)=log.sub.2 {prob(word1, word2)/prob(word1)prob(word2)!}(1) EQU DC(word1, word2)=2prob(word1, word2)/prob(word1)+prob(word2)!(2) EQU TS(word1, word2)=Mprob(word1, word2)-prob(word1)prob(word2)!/prob(word1)prob(word2)! (3)
In the case it is assumed that the number of all of the documents to be the object of retrieval is M, the number of documents including both word1 and word2 is a, the number of documents including only word1 is b and the number of documents including only word2 is c, prob(word1, word2), prob(word1) and prob(word2) are expressed as follows: EQU prob(word1, word2)=a/M (4) EQU prob(word1)=(a+b)/M (5) EQU prob(word2)=(a+c)/M (6)
Any of MI(word1, word2), DC(word1, word2) and TS(word1, word2) means that the higher degree of association exists between word1 and word2 as their values become larger. For obtaining associate words by using these statistical values and preparing the associate word dictionary, the following art was disclosed by "Bilingual Text Alignment Using Statistical and Dictionary Information", Haruno and Yamazaki, Information Processing Society of Japan, SIG Notes, 96-NL-112, pp. 23-30, 1996, "Automated Formation of Bilingual Dictionary Using Statistical Information", Ohmori et al., Proceeding of the Second Annual Meeting of the Association for Natural Language Processing, pp. 49-52, 1996, and so forth.
At first, all words (independent words) included in the document to be the object of retrieval are extracted using technique such as morphological analysis as the first step. Simultaneously, a pointer to an identifier of a document including each of the extracted words is recorded. That is, a structure capable of designating a document including a word based on the word is generated.
Next, as the second step, the first process for word1 and word2 as follows is applied to all binary combinations of the words extracted in the first step.
The first process is described as follows.
The number of the documents including word1 (=a+b), the number of the documents including word2 (=a+c), and the number of the documents including both word1 and word2 (=a) are obtained and each of them is divided by the number of all documents (=M), thus prob(word1), prob(word2) and prob(word1, word2) are calculated. Based on these values, MI(word1, word2) (or DC(word1, word2) or TS(word1, word2))is obtained according to equation (1) (or equation (2) or (3)).
As the third step, the second process for word3 as follows is applied to all the words extracted in the first step to prepare the associate word dictionary.
The second process is described as follows:
The third process for word4 as follows is applied to all the words except word3 and the word obtained as the return value is recorded as an associate word of word3.
The third process is as follows:
If the value of MI(word3, word4) (or DC(word3, word4) or TS(word3, word4)) is larger than the predetermined threshold value T, word4 is the return value. If the value is smaller than T, it means that there is no return value.
By execution of the above processes, the associate words corresponding to all the words extracted in the first step are obtained and retained in the associate word dictionary. The associate words to be registered at the associate word dictionary are limited to those having a value such as mutual information Ml larger than the threshold value T, and therefore it may be considered that the words having relatively high degree of association are registered at the associate word dictionary.
In general, what type of lexicon the associate words of a specific word constitute greatly depend on the field to be the object of retrieval. For example, in the field of information processing, the associate words of "ODA" are "SGML", "HTML", "structured document" and so on, but in the field of economics/sociology, they are "official development assistance", "UNCTAD", "OOF" and so on. In the above-described conventional art, the contents of obtained associate word dictionary are appropriate to the field which is the object of retrieval because the calculation of associate words is executed based on the contents of the document to be the object of retrieval.
In an interactive document retrieval system, narrowing down the documents is conducted as the retrieving process proceeds, and as a result, detection of the desired document becomes easy.
However, in the conventional art, if the documents are narrowed down in the process of retrieval, there occurs a problem that the associate words generated based on the contents of all documents to be the objects of retrieval differ from those necessary for the user.
For example, even if the documents are narrowed down to the set of the field of economics/sociology based on the bibliographic items, in addition to the proper words, "SGML", "HTML", "structured document" and so on are obtained against user's will as the associate words of "ODA" according to the associate word dictionary prepared in conformance to the contents of all the documents.
Even in the case where the associate words are displayed in descending order of degree of association, keywords ranked at higher positions are not always close to the purpose of retrieval of the user if many keywords not reflecting the user's will are included in the associate words as described above. Accordingly, it is the burden for the user to select the proper keywords from the obtained associate words.
Since a person conducting retrieval has human feelings, he/she has a physical and mental limit called a futility point in the process of determination of appropriateness of the associate words. If the number of associate words presented to him/her exceeds the limit, it is impossible for him/her to select all words suitable for the purpose of retrieval.
As described so far, in the conventional interactive retrieval system, ratio of improper keywords to the obtained associate words increases by narrowing down the documents as the retrieving process proceeds. Besides, for making a presentation of the associate words so that the appropriate keywords are sufficiently included, it is necessary to increase the number of associate words to be presented, and accordingly there occurs a problem that the number of presented associate words immediately reaches the futility point as a result. In other words, it is practically impossible to utilize the presentation of the associate words.