1. Field of the Invention
The present invention relates to a method of and an apparatus for retrieving documents matching an indicated condition from a large number of documents.
2. Description of the Related Art
According to one conventional document retrieval process, documents that contain all or some of entered keywords are retrieved from a large number of documents. This document retrieval process is provided as services for retrieving various documents that are available in the Internet or personal computer communication services, and also as software for retrieving documents stored in a hard disk. However, entering a keyword or keywords to indicate a retrieving condition is not effective enough to narrow a large number of documents down to only those documents which the user wants to have, and is disadvantageous in that the retrieved documents tend to include many documents which match the condition but do not meet the user""s needs. Although some services for retrieving various documents that are available in the Internet allow the user to add a keyword or keywords to further narrow down the retrieved documents, they fail to completely eliminate unwanted documents.
To solve the above problems, there have been proposed processes for classifying retrieved documents according to other factors than keywords and presenting classified documents to the user. For example, Japanese laid-open patent publications Nos. 8-235160 and 9-231238 disclose processes for classifying retrieved documents.
Specifically, Japanese laid-open patent publication No. 8-235160 discloses a method of and an apparatus for retrieving documents. According to the disclosed method and apparatus, if the number of retrieved documents is greater than a preset value, the retrieved documents are classified according to attribute data such as document names, document registration dates, etc. assigned to the documents, and the classified documents are presented to the user.
Japanese laid-open patent publication No. 9-231238 discloses a method of and an apparatus for displaying retrieved texts. According to the disclosed method and apparatus, the subjects of retrieved texts are analyzed and divided into a plurality of groups, so that the texts are classified and displayed.
A process for classifying a plurality of documents, disclosed in Japanese laid-open patent publication No. 10-320411, extracts keywords with 5W1H attributes from documents, and classifies the documents into a two-dimensional matrix with the extracted keywords with 5W1H attributes.
However, the above document retrieving processes often fail to narrow documents down to suitable documents for the user or to provide suitably classified documents.
For example, it is assumed that the user who wishes to stay in xe2x80x9cX hotelxe2x80x9d tries to retrieve documents containing a keyword xe2x80x9cX hotelxe2x80x9d in order to obtain information necessary to stay in xe2x80x9cX hotelxe2x80x9d. The information required by the user includes the contact information of xe2x80x9cX hotelxe2x80x9d and the address of xe2x80x9cX hotelxe2x80x9d, and the documents which are required by the user are documents containing the required information. However, only the condition that the keyword xe2x80x9cX hotelxe2x80x9d be included in documents is not specific enough to narrow a large number of documents down to only those documents which contain the contact information of xe2x80x9cX hotelxe2x80x9d and the address of xe2x80x9cX hotelxe2x80x9d. For example, documents retrieved under the above condition may include a document containing a news reporting that a new product has been presented in the X hotel and a Web document resembling a diary which states that someone enjoyed a dinner at a restaurant in the X hotel, though these documents are not required by the user. Since the condition that the contact information and the address be included in documents cannot be expressed by keywords, it is impossible to limit retrieved documents and exclude unwanted documents by adding a keyword or keywords.
With the method of and the apparatus for retrieving documents disclosed in Japanese laid-open patent publication No. 8-235160, retrieved documents can be classified according to attributes assigned to the documents. Therefore, attributes necessary to classify documents need to be assigned to the documents in advance. Unless information about the contact information and the address is recorded as attributes of documents, the retrieved documents cannot be classified into documents with the contact information and the address and documents without the contact information and the address. In particular, it is difficult for the disclosed system to deal with Web documents available in the Internet.
According to the disclosed method and apparatus of Japanese laid-open patent publication No. 9-231238, the retrieved texts are classified according to their subjects into those texts with the subjects containing information as to the contact information and the address and those texts with the subjects containing no information as to the contact information and the address. However, some texts with the subjects containing no information as to the contact information and the address may contain information as to the contact information and the address in their bodies. For example, a news reporting that the X hotel has added a new annex in its subject may possibly contain information as to the contact information and the address in its body. Therefore, the disclosed classification principle may not necessarily be effective to classify retrieved documents into those required by the user and those not required by the user.
An apparatus for and a method of classifying documents and a recording medium which stores a program for classifying documents, as disclosed in Japanese laid-open patent publication No. 10-320411, are capable of classifying documents with keywords with 5W1H attributes extracted from the documents. However, the type of 5W1H as a key for classification needs to be indicated by the user each time documents are to be classified. Furthermore, since documents are classified according to the unit of 5W1H, they cannot be classified according to smaller units including address, nearby station, telephone number, and e-mail address.
It is therefore an object of the present invention to provide a method of and an apparatus for easily retrieving documents that are required by the user.
A document retrieval apparatus according to a first aspect of the present invention classifies retrieved documents based on whether documents contain attribute elements representing specific contents related to certain attributes (concepts), and classifies documents containing attribute elements related to the certain attributes according to types of the certain attributes. The attribute elements represent elements which specifically indicate the contents of certain attributes, such as address, telephone number, nearby station, price, date, time, e-mail address, URL, company name, product name, type number, in the documents. For example, an attribute element representing an attribute of address is xe2x80x9cChiyoda ward, Tokyo metropolisxe2x80x9d, and an attribute element representing an attribute of price is xe2x80x9c12,000 yenxe2x80x9d.
Specifically, the document retrieval apparatus has a classification attribute storage storing only types of indicated attributes, among a plurality of types of attributes that can be used to classify documents, an attribute analyzing means for analyzing each of the retrieved documents to determine whether an attribute element belonging to the types of attributes stored in the classification attribute storage is contained in the document or not, and an attribute classifying means for classifying each of the retrieved documents such that documents containing the same type of attribute elements fall in the same category and documents containing no attribute elements fall in an independent category.
The attribute classifying means analyzes each of the retrieved documents, and sends information indicating which one of the types of attributes stored in the classification attribute storage an attribute element contained in the document belongs to, to the attribute classifying means. Based on the sent information, the attribute classifying means decides whether each of the retrieved documents contains an attribute element belonging to either one of the types stored in the classification attribute storage or not. If the document contains an attribute element, then the attribute classifying means classifies the document into a category corresponding to the type of the attribute element contained therein. If the document does not contain an attribute element belonging to any one of the types stored in the classification attribute storage, then the attribute classifying means classifies the document into a category of documents containing no attribute elements.
A document retrieval apparatus according to a second aspect of the present invention classifies retrieved documents based on whether documents contain the same attribute element of a certain type. Specifically, the document retrieval apparatus has a classification attribute storage storing only types of indicated attributes, among a plurality of types of attributes that can be used to classify documents, an attribute element extracting means for extracting an attribute element belonging to the type of an attribute indicated by a user who has made a retrieval request, among the types of attributes stored in the classification attribute storage, from each of the retrieved documents, and an attribute element classifying means for classifying each of the retrieved documents such that documents containing the same type of attribute elements fall in the same category and documents containing no attribute elements fall in an independent category.
The attribute element extracting means extracts an attribute element belonging to the type of an attribute indicated by a user who has made a retrieval request, among the types of attributes stored in the classification attribute storage, from each of the retrieved documents, and sends information indicating which document contains which attribute element to the attribute element classifying means. Based on the sent information, the attribute element classifying means decides whether each of the retrieved documents contains an attribute element of the type indicated by the user or not. If the document contains an attribute element, then the attribute element classifying means classifies the document into a category corresponding to the attribute element contained therein. If the document does not contain an attribute element belonging to any one of the types stored in the classification attribute storage, then the attribute element classifying means classifies the document into a category of documents containing no attribute elements.
A document retrieval apparatus according to a third aspect of the present invention classifies retrieved documents such that documents containing attribute elements of a certain type which have similar meanings fall in one category. Specifically, the document retrieval apparatus has a classification attribute storage storing only types of indicated attributes, among a plurality of types of attributes that can be used to classify documents, a thesaurus storage storing words as hyperonyms of words, an attribute element extracting means for extracting an attribute element belonging to the type of an attribute indicated by a user who has made a retrieval request, among the types of attributes stored in the classification attribute storage, from each of the documents retrieved by the document retrieving means, and an attribute element thesaurus classifying means for classifying each of the retrieved documents such that documents with respect to which words representing the extracted attribute element and corresponding to a hyperonym at a level indicated by the user are the same as each other fall in one category.
The thesaurus storage contains words arranged as hyperonyms and hyponyms in a hierarchical structure, with absolute levels assigned to respective levels of the hierarchical structure. The attribute element extracting means extracts an attribute element belonging to the type of an attribute indicated by the user, among the types of attributes stored in the classification attribute storage, from each of the retrieved documents, and sends information indicating which document contains which attribute element to the attribute element thesaurus classifying means. Based on the sent information, the attribute element thesaurus classifying means decides whether each of the retrieved documents contains an attribute element of the type indicated by the user or not. If the document contains an attribute element, then the attribute element thesaurus classifying means looks up the thesaurus storage, determines a word as a hyperonym of the attribute element at the level indicated by the user, and classifies the document into a category corresponding to the attribute element as the hyperonym. If the document does not contain an attribute element of the type indicated by the user, then the attribute element thesaurus classifying means classifies the document into a category of documents containing no attribute elements.
A first advantage of the present invention is that it is possible for the user who has made a retrieval request to easily select documents containing an attribute element of the type required from a number of retrieved documents.
The reason for the first advantage is that the types of attributes to be actually used for classifying retrieved documents are selected from types of attributes that can be used as classification keys, e.g., address, telephone number, nearby station, price, date, time, e-mail address, URL, company name, product name, type number, etc., and stored in the classification attribute storage, and the retrieved documents are classified using only the types of attributes stored in the classification attribute storage. Specifically, since an effective classification keys (classification factors) is different for each field to which documents to be retrieved belong, if documents are classified using a classification key fixed to 5W1H, then the documents may not be classified in a manner allowing the user to sort out the documents easily. According to the present invention, since the user can select a type of attribute depending on the field to which documents to be retrieved belong, from many types of attributes and use the selected type of attribute as a classification key, the documents can be classified in a manner allowing the user to sort out the documents easily.
A second advantage of the present invention is that the retrieved documents can be divided into documents containing an attribute element in question and documents containing no attribute element in question. If documents containing no attribute element in question are not required, then the unwanted documents can easily be excluded from the retrieved documents.
The reason for the second advantage is that the attribute analyzing means analyzes each of the retrieved documents to determine which type of attribute element stored in the classification attribute storage is contained in the document, and the attribute classifying means classifies documents which do not contain the attribute elements of the types stored in the classification attribute storage into an independent category.
A third advantage of the present invention is that retrieved documents can be classified according to an attribute element of a certain type in the documents. As a result, the user who needs documents containing an attribute element of a certain type can obtain retrieved documents that have been classified according to specific contents of the documents, i.e., contents corresponding to an item required by the user. As a consequence, the retrieved documents can further be narrowed down.
The reason for the third advantage is that the attribute element extracting means extracts an attribute element of the type indicated by the user, and the attribute element classifying means classifies the retrieved documents such that documents containing the same attribute element fall in the same category.
A fourth advantage of the present invention is that retrieved documents containing attribute elements which have similar meanings are classified into one category so that categories in which the retrieved documents are classified will not be too detailed. When the user specifies a level for classification, the user can obtain classified documents at a desired detailed degree.
The reason for the fourth advantage is that the thesaurus storage holds words as hyperonyms of words, and the attribute element thesaurus classifying means determines a word as a hyperonym at a level indicated by the user from attribute elements extracted from the documents, and classifies each of the retrieved documents such that documents whose determined words are the same as each other fall in one category.
A fifth advantage of the present invention is that it is possible to reduce the number of categories so that there will not be too many categories for classifying retrieved documents.
The reason for the fifth advantage is the same as the reason for the fourth advantage. Specifically, the thesaurus is looked up, and documents containing attribute elements which have similar meanings are classified into one category for thereby reducing the number of categories.