1. Field of Invention
The present invention relates to a document retrieval device and more particularly to a document retrieval device that retrieves documents that have a tree structure, such as those where certain structural elements include other structural elements, and classifies and displays them for each structural element.
2. Description of Related Art
Generally, methods that extract key words from the content of a document use a broad interpretation of the key words in order to retrieve documents, or portions of a document, having the characteristics of the key words. The key words are extracted from the document by comparing the document content to the word lists prepared beforehand, using morphological analysis, and the like.
Some systems extract key words from documents and automatically perform the retrieval of documents. Using a technique referred to as the vector space model, the document and the query of the document retrieval (hereinafter query) are expressed by means of the vectors where the weight of a key word is expressed as an element. Then, the degree of similarity of the document vectors and the query vector are calculated, and documents are output as the results of the document retrieval ordered by the degree of similarity. Documents having a high degree of similarity of vectors are pre-classified in sets, or categories, with consideration being given to more efficient document retrieval.
Additionally, if the key word is used in the document, a positive value is set, and if it is not used, 0 is set in the weight corresponding to the key words. Also, Term Frequency (TF) in the document of corresponding key words and Inverse Document Frequency (IDF) within the categories are used in setting the weight.
The degree of similarity between the query and the document, the degree of similarity among the documents, the degree of similarity between the documents and categories, and the like can be calculated by this document automatic retrieving system using the extracted key words. There are various methods of calculation of the degrees of similarity, such as the simple method where the similarity is determined by the number of common key words, and the method that performs calculations of the degree of similarity of the vectors that provide emphasis based on the frequency of appearance and the dispersion of the key words. Furthermore, in the methods that calculate the degrees of similarity of the vectors, the inner products of the vectors and the cosine coefficients are often used. Most of the automatic document retrieval system retrieve documents as whole units.
However, in the case that the units of retrieval are documents as a whole, the retriever can get the entire documents as retrieval results, and the retriever cannot locate the necessary sections if he does not look over the entire body of the document, regardless of the document size. Conversely, when multiple topics are present in the documents, even if queries are performed relating to the topics that are included in the documents, the documents can not be retrieved because the degree of similarity between the whole body of the document and the queries is low.
Concerning these problems, an example is provided hereafter with an explanation. FIG. 29 illustrates an example of the result of automatic document retrieval. The four documents, 301, 302, 303, 304 and the key word set that was extracted from the respective content are shown together. However, in order to simplify the explanation, the number of key words that appears in each document is made to be smaller than that in the actual documents.
The four documents 301, 302, 303, 304 can be classified into two categories, 310 and 320, based on the degree of joint-ownership of the key words. Category 310 can be assumed to contain documents describing systems for the visualization of the information space from the common key words of document 301 and document 302, including "information", "space", "visualization", "structure", "architecture", and "experiment". Similarly, category 320 can be assumed to contain documents describing systems that classify documents based on the degree of similarity from the common key words of document 303 and document 304, including "document", "similarity", "vector", "classification", "experiment", "evaluation", and "precision".
FIG. 30 illustrates each document shown in FIG. 29 with their respective paragraph structure. Each document is separated into multiple paragraphs, and key words are extracted from each paragraph. Furthermore, the key word set corresponding to each document illustrated in FIG. 29 is the logical sum of the key word set of all the paragraphs of each document.
If the user has an interest relative to "document retrieval", the parts that are thought to be related to "document retrieval" within the documents illustrated in FIG. 30 comprise the second paragraph of document 302 and the second paragraph of document 303.
However, document 302 and document 303 are classified into different categories 310 and 320, respectively, in FIG. 29. Moreover, the subject of category 310 comprises "information space visualization" and the subject of category 320 comprises "classification of documents", and there is no relevancy for either of them in terms of "document retrieval". As a result, when the documents that are mentioned relative to "document retrieval" are classified into categories 310 and 320, it is difficult for the user to conjecture that the related documents are classified into the categories.
Thus, in the situation where the units of retrieval are the documents, even if multiple topics are revealed in the documents, this information gets buried in the main subjects of the bodies of the documents. As a result, even if topics which certainly have relevancy are contained in the documents, the problem occurs where they cannot be retrieved.
When the documents are partitioned into logical structure elements, for example, chapters, sections, paragraphs, and the like, a method that resolves the above problems with a document automatic retrieval system is the method that retrieves the partitioned structural elements as units.
For example, chapter headings and paragraphs are retrieved from the document, and the degree of similarity between the query and chapter headings, as well as the degree of similarity between the query and paragraphs, are calculated respectively. The two degrees of similarity are added, and this is made to be a degree of similarity between the query and the whole body of the chapter. This method is disclosed in Japanese laid-open patent 4-84271 publication "document content retrieval device", that outputs as retrieval results the chapter units in the order commencing from the high degrees of similarity. Using this method, chapters, where the words relating to the query are included in both of the chapter heading and the paragraphs, can be retrieved and placed in higher order than chapters that only include the words in one side.
However, by this method, the documents are assumed to have only chapters and paragraphs, and no consideration is given concerning documents having more detailed structure. Also, each chapter is treated as independent information, and no consideration is given concerning the positions that the content of these chapters occupy in the document, or, in other words, their context.
FIG. 31 is an example of documents having logical structural elements comprising titles, chapters, sections, paragraphs and the like.
In FIG. 31, paragraph P2 holds the context, that is, for example, in the section relative to "document structure analysis" in the chapters relative to "usage of natural language processing" in the documents having a title "technological advancements in information retrieval".
However, in the case where this paragraph is retrieved by the above methods, the target of calculation of similarity are only key words, "sentence, meaning, role, retrieval", that were extracted from Paragraph P2, as well as the key words "document, structure, analysis", that were extracted from the headings of Chapter 3 Section 2, and the above context is not considered at all. Therefore, Paragraph P2 is not assumed to mention the technology that uses the natural language processing, and as a result, it can not be retrieved by the query relative to natural language processing.
Additionally, if classification to the categories is performed by means of the structural elements units without considering the context, a similar problem occurs. In FIG. 31, if the paragraphs are classified based on the content of Paragraph P2, Paragraph P2 is not classified as being "the paragraph relating to the document structure analysis in the natural language processing".
Thus, the structural elements of the document are treated as only having the information included just by this means.
The present invention has the objective of providing a document retrieval device wherein retrieval is made possible of the structural element units by taking into consideration the context of the entire body of a document.