We have been able to access a lot of information using the Internet, intranet or the like. However, as information to be accessible increases, it becomes more difficult to efficiently acquire only necessary information. Therefore, there is a demand for a system which can adequately retrieve documents containing necessary information. Recently, for commonly or the like sharing information within an organization, companies which use systems capable of retrieving intra-office documents (e.g., materials created by Microsoft Office products or the like) are increasing.
A document retrieval system retrieves documents which fulfill the retrieval condition input by a searcher from a database storing documents. Then, the document retrieval system displays the retrieval results arranged in the order of the relevancy to the retrieval condition from one which is judged to have a higher relevancy by the document retrieval system. The retrieval condition is a key for document retrieval which is input to the document retrieval system, and is generally described by key words and a method of a logical operation on the key words. As a searcher designates the logical operation method, the searcher can designate targets to be retrieved by the document retrieval system, such as a document containing all the key words, a document containing one of the key words, or a document which does not contain a specific key word. The process of arranging documents in the order of the relevancy to the retrieval condition from one which has a higher relevancy is called “ranking”, and the order obtained by ranking is called “rank”.
Ranking is important in the document retrieval system. Searchers need a considerable time and efforts to view all the documents in the retrieval results. Therefore, most of the searches view only documents which are ranked high, and perform retrieval again under a different retrieval condition if there is desired information. That is, documents which are ranked low have a significance for a searcher only in the number of documents which are found in the retrieval, and appear as being nonexistent. Therefore, there is a demand for a document rating calculating technique for making documents needed by the searcher ranked higher.
The quantization of the rating of a document is called “scoring”, and a quantized value obtained by scoring is called “score”. Scoring methods are roughly separated into three classifications. (1) A method which uses information in a document, (2) a method which uses information outside a document, and (3) a method which uses the operational history of a searcher. According to the method 1, the rating of a document is calculated based on the amount of inclusion of a character string given as the retrieval condition, the uniqueness of the character string in the retrieval condition, the relation of common occurrence between the character string and a character string in a document, the number of links to another document, and so forth. According to the method 2, the rating of a document is calculated based on the depth of the directory where a document is present, the date of creation of a document, the update date or the update frequency thereof, etc. According to the method 3, the rating of a document is calculated based on the number of references made by the searcher, the revisiting history and the like. In addition, those methods 1 to 3 may be combined. Since the method 1 involves scoring based on the contents of documents, it is easier to reflect the relevancy between a retrieval condition and a document on ranking as compared with the methods 2 and 3. The following will describe a technique related to the method 1.
One example of the document rating calculating technique is described in Patent Literature 1. The “method and system for retrieving related information” described in Patent Literature 1 determines the rating of a document according to a plurality of criteria for sequencing. Even in a case where the rating of a document becomes a low value in one scoring, if the document is ranked higher according to another criterion, it becomes easier for a searcher to find a necessary document. Further, the general rating of documents is calculated by obtaining the document size, the document update frequency, the number of links included, the ratio of key words contained, the number of related key words, the date of document creation, and the like are used as criteria for the ratings. In addition, the document size, the document update frequency, the date of document creation, and the like, which correspond to the method 2, are used as criteria for the ratings.
One example of other document rating calculating techniques is described in Patent Literature 2. The “method for analyzing electronic document to be retrieved and electronic document registration system” described in Patent Literature 2 extracts table-of-contents information contained in a document, divides the body item by item in the table of contents, and registers the body segments. As a document is divided into items, the contents which fulfill a retrieval condition can be retrieved item by item. Although this system does not have a process of calculating the rating of a document, it is regarded as an existing technique for calculating document ratings item by item.
Patent Literature 3 describes a character string retrieval device which performs fuzzy retrieval of a plurality of documents having a set of documents hierarchized in one or more levels, or two or more levels. Fuzzy retrieval includes a character string in a document which does not exactly coincide with a specific character string in retrieval results. All character strings in a document are searched for a specific character string, and their degrees of coincidence are determined. The degrees of coincidence mean the degrees of coincidence of a character string given as a retrieval condition with character strings in the document. The degrees of coincidence of character string sets in each level are totaled in order from the lowest level for each document, and the highest degree of coincidence is considered as the degree of coincidence for that level. Further, the degree of coincidence in the topmost level in each of the documents is considered as the degree of coincidence of that document. That is, the technique of Patent Literature 3 can be said as a technique of specifying how much each item coincides with a retrieval item.
Patent Literature 4 describes that the average result score of a partial set of other documents having utilization information including the utilization information of a document is defined as an expected score. Patent Literature 4 describes that a document score is calculated by the combination weighting the result score and the expected score with the respective significance degrees according to the size of the partial set. The document ranking system of Patent Literature 4 ranks of document sets whose retrieval is requested by a user by using the document scores of the individual documents in the document score database. The technique of Patent Literature 4 is basically equivalent to the method 3 of using the operational history, however, the technique of Patent Literature 4 is regarded as a technique of correcting the rating of a document with information outside the document.
Patent Literature 5 describes that a retrieval device using index data is used together to perform document retrieval by determining an important word based on both the importance level of each word alone which is extracted by a systematic scheme, and the importance level of a word in a specific context.
Patent Literature 6 describes that a scale expression word is extracted from an input text by referring to a set of scale expression words as words of an attribute which can have a quantitative value. Patent Literature 6 also describes that a word corresponding to one of a word which is contiguous to the extracted scale expression word to form a compound word, a word modifying the extracted scale expression word, and a word which is modified by a phrase containing the extracted scale expression word is extracted as a scale expression related word. At the time a key word is weighted, a weight calculated based on a preset calculation method is imparted to the scale expression word or the scale expression related word.
Patent Literature 1: Unexamined Japanese Patent Application KOKAI Publication No. 2000-242647
Patent Literature 2: Unexamined Japanese Patent Application KOKAI Publication No. 2000-330979
Patent Literature 3: Unexamined Japanese Patent Application KOKAI Publication No. H06-301725
Patent Literature 4: Unexamined Japanese Patent Application KOKAI Publication No. 2002-342379
Patent Literature 5: Unexamined Japanese Patent Application KOKAI Publication No. 2003-271619
Patent Literature 6: Unexamined Japanese Patent Application KOKAI Publication No. 2005-301855