1. Field of the Invention
The present invention is related to an information processing apparatus, a full text retrieval method, and a computer-readable encoding medium recorded with a computer program thereof.
2. Description of the Related Art
In many full text retrieval systems using an inversion index, when displaying a retrieval result list, a relevance ratio of a key word being input and each of documents being retrieved is depicted by a numerical value as a score, and the documents are ordered in a greater ratio order (for example, refer to “Hiroko Mano, Hideo Itoh, Yasushi Ogawa, ‘Ranking Retrieval In Document Retrieval’, Ricoh Technical Report, No. 29, Dec. 12, 2003”). In general, the score denotes a degree of relevance to the documents found with respect to the key word being input. The degree of relevance to the documents is based on an appearance frequency of the key word in the document.
For the scores of the documents with respect to the retrieval word, a PageRank using link information among documents is used in a search engine of the Web. However, the PageRank is not operative for data in which the link information does not exist. In a case of retrieving document data or a like in a business organization, in general, a probabilistic model is preferably used for a ranking retrieval of the documents.
There are a few cases in that only an original document is a retrieval subject when retrieving document data or a like in the business organization. In other words, for example, the same documents being distributed to each department are redundantly registered in a database. Moreover, documents regarding a project document, a specification, and a like, of which contents are similar to each other and only a document version is different from each other, are registered in the database. Thus, if the ranking retrieval is simply conducted with respect to like this document set, score values of the same documents and analogous documents become approximately the same values. Accordingly, when the documents are listed in a score order, it is not easy to distinguish documents when analogous documents are successively displayed. In addition, it is difficult to reach a target document.
Japanese Laid-open Patent Application No. 2006-31209 discloses a full text retrieval method which allows a user to customize a score in the ranking retrieval of documents. By using the full text retrieval method, only original documents are listed at higher scores. However, since the business organization stores a huge amount of document data, it takes time and is not practical to customize scores of all original documents. Also, it is difficult to specify the original documents.
For this problem, a document clustering technology is well known to classify the same or analogous documents (for example, refer to “Kazuaki Kishida, Techniques of Document Clustering: A Review, Library and Information Science No. 49, 2003”). By using the document clustering technology, the same or analogous documents are grouped as a retrieval subject. Then, only a representative document in each group is displayed as a retrieval result. As a result, it is possible to avoid displaying documents having similar contents as the retrieval result.
However, in the document clustering technology of Kazuaki Kishida, a considerably large amount of calculation is required. In a case of a huge amount of retrieval subjects, it is not practical to classify all retrieval subject beforehand due to the considerably large amount of calculation.