1. Field of the Invention
The present invention relates to an information retrieval (IR) apparatus, IR method, and a storage medium storing a program for realizing the process.
2. Description of the Related Art
Recently, it has been more and more popular to process and store document information using electronic appliances and storage media, and it has become common to share document information among a number of users. Normally, documents can be shared using a database. The database is normally stored in an external storage device. However, the storage capacity of an external storage device has been extended year by year, and the volume of documents to be stored in the database has become enormously large.
As a system of retrieving such a database, a Boolean IR system, a non-Boolean IR system, and a combination system of the two IR systems (hereinafter referred to as a combination system) are used.
In the Boolean IR system, a document (or a set of documents) containing a keyword is defined as xe2x80x98truexe2x80x99, and a document (or a set of documents) not containing a keyword is defined as xe2x80x98falsexe2x80x99, and a document (or a set of documents) whose logical expression inputted as a retrieval query is xe2x80x98truexe2x80x99 can be specified. The retrieval query can be a logical expression obtained by connecting a plurality of keywords using logical symbols such as AND, OR, NAND, etc.
The non-Boolean IR system is a user-friendly IR system aiming at allowing a common user to easily retrieve necessary data. Various methods are proposed by the non-Boolean IR system. For example, a method of retrieving data through a fuzzy IR system using a multi-value logic instead of a binary logic of xe2x80x98truexe2x80x99 or xe2x80x98falsexe2x80x99 (for example, the invention disclosed by the Japanese Patent Laid-Open No. 06-162101 published by the Japanese Patent Office), a method of retrieving data in natural language text using an input device for receiving natural language text as a retrieval query not a logical expression (for example, the invention disclosed by the Japanese Patent Laid-Open No. 03-130873), and a similarity retrieval device for ranking retrieval results in the natural language text and display them (for example the invention disclosed by the Japanese Patent Laid-Open No. 03-172966) are proposed. Normal ranking retrieval is classified as a non-Boolean IR system.
As a combination system, a device for generating a logical expression for a Boolean IR system from natural language text (for example, the invention disclosed by the Japanese Patent Laid-Open No. 10-134078 published by the Japanese Patent Office) is proposed.
In addition, a method for manipulating the ranking order in a ranking IR system, a IR system which assigns a hierarchical level for a ranked document (for example, the invention disclosed by the Japanese Patent Laid-Open No. 09-153066 published by the Japanese Patent Office) is proposed. This system analyzes a syntax called xe2x80x98functional unitxe2x80x99 in a user-inputted sentence, and sets up a hierarchy for each functional unit.
However, the above described conventional IR systems and the ranking IR system have the following problems.
First, in the Boolean IR system, a retrieval query is evaluated by two values xe2x80x98truexe2x80x99 and xe2x80x98falsexe2x80x99, thereby applying a strict retrieval condition to the retrieval query. Therefore, it is difficult for a user to appropriately generate a retrieval query specifying a desired document (or a set of documents). There also has been the problem that a user has to be well-trained in generating the retrieval query.
In addition, in the non-Boolean IR system and the combination system, the similarity between a retrieval query and a document is determined by a system, and a user cannot easily change the similarity. To solve the problem, a IR system (for example, the invention disclosed by the Japanese Patent Laid-Open No. 07-225772 published by the Japanese Patent Office) which is provided with a device through which a user can input the weight between keywords to reflect the intention of the user in the retrieval has also been proposed. However, the final weight of keywords is determined by the similarity computation mechanism in a IR system. As a result, there is the possibility that a retrieval result deviates from the intention of the user.
Furthermore, according to the invention disclosed by the Japanese Patent Laid-Open No. 09-153066 published by the Japanese Patent Office, there has been the problem that the functional unit of a user-inputted sentence does not always match the functional unit of a relevant document.
As described above, since a retrieval query is evaluated by two values xe2x80x98truexe2x80x99 and xe2x80x98falsexe2x80x99 in the Boolean IR system, the retrieval condition is strict, and a user has to be well-trained to effectively use the IR system. In addition, to solve problem with the Boolean IR system, the non-Boolean IR system and the combination system are designed to determine the similarity between a retrieval query and a document by each system, and the user cannot easily change the ranking order of documents. Furthermore, there is the problem in the non-Boolean IR system using a natural language that the current natural language processing technology is not completed, and cannot sufficiently analyze the intention of a user only according to the information in a natural language.
The above described problems with the conventional technology can be summarized as follows.
1) Since a complicated retrieval query should be generated to appropriately perform a document retrieving process in the Boolean IR system, it takes a long time for the user to become skillful in using the system. In other words, a beginner user cannot sufficiently utilize the system, and only a skilled user can effectively use the system.
2) In a simple non-Boolean IR system, the occurrence number of a keyword determines the similarity. Therefore, there is the possibility that a document not requested by the user may change ranking order.
3) Furthermore, the non-Boolean IR system has the following problems with user-input.
1. In the retrieval query in a natural language, detailed query cannot be performed for the similarity computation mechanism. Therefore, the retrieval query cannot be performed with the intention of a user sufficiently reflected.
2. In the IR system in which the weight between keywords is specified, it is necessary for a user to fully understand a similarity computation method used in the IR system. Therefore, a common user cannot easily use the system.
It can be recognized that the system of adding the weight between keywords cannot reflect the intention of a user because the adding of the weight does not apply to the feeling of a user. That is, in the conventional system, the influence of the weight specified by a user on the similarity depends of the designer of the IR system. When the concept of the designer is different from the recognition of the user, the user cannot specify the weight of a keyword which can sufficiently reflect the intention of the user.
In addition, in a normal similarity computation mechanism, the occurrence number of a keyword is an important factor for determining the similarity. However, the mechanism is not provided with a unit for determining whose similarity is higher, a document containing a larger number of types of keywords, or a document containing a frequently occurring keyword. However, the intention of a user determines which is prioritized between the above described two documents. Therefore, which between the above described two documents is prioritized depends on each retrieval query and each keyword, but there are no IR systems designed in consideration of this point.
The present invention aims at providing an IR system capable of describing data as correctly as the Boolean IR system without obtaining the knowledge about a complicated logical expression or knowing the designing concept of the IR system, and of easily reflecting the intention of a user in a ranking result.
Described below is each aspect of the present invention. According to the present invention, the word xe2x80x98a set of documentsxe2x80x99 refers to plural documents, and can refer to a single document. That is, an element of a set of documents is a document. A set of documents is a set of a single or a plurality of documents. A set can be empty, but an empty set of documents indicates that no documents corresponding to a retrieval query can be found in a document database when a document database is searched.
The information retrieval device according to the first aspect of the present invention is based on the information retrieval device for retrieving a document corresponding to a user-inputted retrieval query from a document database, and includes each of the following units.
An input unit is used to input a retrieval query represented by a proposition to which a modal operator used in a modal logic is added.
A document set gathering unit searches a document database, and gathers a set of documents having the proposition of the retrieval query as xe2x80x98truexe2x80x99.
A similarity computation unit computes the similarity of the gathered set of documents.
A retrieval result output unit hierarchically ranks the set of documents corresponding to the inputted retrieval query according to the result of the gathering by the document set gathering unit and the result of the computation by the similarity computation unit, and outputs the ranking result.
In the information retrieval device according to the first aspect of the present invention, a modal operator which is used in a modal logic is included in the description of a retrieval query to reflect the intention of a user in the retrieval query. In addition, a retrieval query can be more easily generated using a modal operator than a retrieval query in the Boolean IR system, and the load of the user required in generating a retrieval query can be successfully reduced. In addition, using a modal operator, the user can specify the weight for a keyword based on his or her own feeling. Therefore, the user can represent his or her intention in a retrieval query, and the system can obtain the intention of the user through the retrieval query.
Furthermore, a modal operator includes a necessity symbol for assignment of the necessity concept xe2x80x98true in the entire worldxe2x80x99 to a proposition, and a possibility symbol for assignment of the possibility concept xe2x80x98true in a certain worldxe2x80x99 to a proposition. The user limits a set of documents to be obtained as a retrieval result using a necessity symbol, and greatly affects the ranking order using a possibility symbol.
In addition, in response to the proposition in the inputted retrieval query, the retrieval result output unit determines the position of the set of documents in the hierarchy based on the numbers of true propositions and false propositions as an evaluation reference so that each set of documents obtained by the retrieval can be hierarchically ranked and presented to the user.
Furthermore, the retrieval result output unit ranks plural sets of documents positioned in the same hierarchy in order from higher in similarity so that the sets of documents can be further ranked in each hierarchy and presented to the user. In this case, the similarity corresponds to the occurrence number of a keyword gathered in a retrieval query.
In the information retrieval device according to the second aspect of the present invention further includes, in addition to each unit in the above described information retrieval device according to the first aspect of the present invention, a common keyword extraction unit for extracting a common keyword in each document in each set of documents ranked by the retrieval result output unit.
With the configuration, the information retrieval device according to the second aspect of the present invention can extract a keyword commonly contained in all documents obtained through the retrieval, a keyword commonly contained in all documents ranking high, a keyword commonly contained in all documents ranking low, etc. through the common keyword extraction unit regardless of the keyword contained in the user-inputted retrieval query.
In addition, the retrieval result output unit can add a necessity symbol to and output a keyword commonly contained in all documents, add a possibility symbol to and output a keyword commonly contained in all documents of ranking higher order, and add a possibility symbol and a negation operator to and output a keyword commonly contained in all documents of ranking lower order. Thus, the user can compare and check the user-inputted keyword with the keyword output from the system, and can select a candidate for a keyword to be next inputted.
The information retrieval device according to the third aspect of the present invention is based on the information retrieval device for retrieving a document database corresponding to a retrieval query inputted by the user from a plurality of document databases, and includes the following units.
An input unit is used to input a retrieval query represented by a proposition to which a modal operator used in a modal logic is added.
A document set gathering unit searches a document database, and gathers a set of documents having the proposition of the retrieval query as xe2x80x98truexe2x80x99.
A necessity/possibility condition discrimination unit discriminates a document database satisfying a condition prescribed by a modal operator added to the proposition based on the gathering result obtained from the document set gathering unit.
In the information retrieval device according to the third aspect of the present invention, a modal operator for use in the above described modal logic is introduced to the retrieval query for use in retrieving a plurality of document databases. Therefore, a document database for special use by the user, or a relevant document database not for special use can be discriminated. The database for special use can be a document database satisfying, for example, the necessity condition that the proposition is true in all stored documents. The relevant database not for special use can be a document database satisfying, for example, the possibility condition that the proposition is true in at least one of the stored documents.
The information retrieval device according to the fourth aspect of the present invention includes, in addition to the units contained in the information retrieval device according to the third aspect of the present invention, a retrieval result output unit for adding a modal operator to and outputting the name of the document database discriminated as satisfying the above described condition by the necessity/possibility condition discrimination unit.
Therefore, in the information retrieval device according to the fourth aspect of the present invention, a user can be informed of a database for special use by the user by adding a necessity symbol to and outputting the name of the document database. Furthermore, the user can be informed of a useful relevant database not for special use by the user by adding a possibility symbol to and outputting the name of the document database.
An IR method according to the fifth aspect of the present invention includes the following steps (a) thorough (d) based on the IR method for retrieving a document corresponding to a user-inputted retrieval query from a document database.
(a) inputting a retrieval query represented by a proposition to which a modal operator for use in a modal logic is added:
(b) searching a document database, and gathering a set of documents containing the proposition of the retrieval query as xe2x80x98truexe2x80x99;
(c) computing the similarity of the gathered set of documents;
(d) hierarchically ranking and outputting the set of documents corresponding to the inputted retrieval query based on the gathering result of the document set and the computation result of the similarity.
The IR method according to the fifth aspect of the present invention has an operation and an effect similar to those of the information retrieval device according to the first aspect of the present invention.
The IR method according to the sixth aspect of the present invention is based on the IR method for retrieving a document corresponding to a user-inputted retrieval query from a document database, and includes, in addition to the above described steps (a) through (d) of the IR method according to the fifth aspect of the present invention, a step (e) of extracting a common keyword in each document of each set of documents ranked in the above described step (d).
The IR method according to the sixth aspect of the present invention has an operation and an effect similar to those of the information retrieval device according to the second aspect of the present invention.
The IR method according to the seventh aspect of the present invention is based on the IR method for retrieving a document database corresponding to a user inputted retrieval query, and includes the following steps (a) through (c).
(a) inputting a retrieval query represented by a proposition to which a modal operator for use in a modal logic is added;
(b) searching a plurality of document databases, and gathering a document database containing the proposition of the retrieval query as xe2x80x98truexe2x80x99; and
(c) discriminating a document database satisfying a condition prescribed by the modal operator added to the proposition based on the gathering result obtained in the step (b).
The IR method according to the seventh aspect of the present invention has an operation and an effect similar to those of the information retrieval device according to the third aspect of the present invention.
The IR method according to the eighth aspect of the present invention includes, in addition to the above described steps (a) thorough (c) according to the IR method according to the seventh aspect of the present invention, a step (d) of adding a modal operator to and outputting the name of the document database discriminated as satisfying the above described condition in the step (c).
The IR method according to the eighth aspect of the present invention has an operation and an effect similar to those of the information retrieval device according to the fourth aspect of the present invention.
The computer-readable storage medium according to the ninth aspect of the present invention stores a program for directing a computer to perform the process including the steps of:
(a) inputting a retrieval query represented by a proposition to which a modal operator for use in a modal logic is added;
(b) searching a document database, and gathering a set of documents containing the proposition of the retrieval query as xe2x80x98truexe2x80x99;
(c) computing the similarity of the gathered set of documents;
(d) hierarchically ranking and outputting the set of documents corresponding to the inputted retrieval query based on the gathering result of the set of documents and the computation result of the similarity.
The storage medium according to the ninth aspect of the present invention stores a program for realizing by a computer an operation and an effect similar to those of the information retrieval device according to the first aspect of the present invention.
The storage medium according to the tenth aspect of the present invention stores a program for directing the computer to perform, in addition to the process containing the above described steps (a) through (d) of the program stored in the storage medium according to the ninth aspect of the present invention, the step (e) of extracting a keyword common in each document in each set of documents ranked in the above described step (d).
The storage medium according to the tenth aspect of the present invention stores a program for realizing by a computer an operation and an effect similar to those of the information retrieval device according to the second aspect of the present invention.
The storage medium according to the eleventh aspect of the present invention stores a program for directing a computer to perform the process including the steps of:
(a) inputting a retrieval query represented by a proposition to which a modal operator for use in a modal logic is added;
(b) searching a plurality of document databases, and gathering a document database containing the proposition of the retrieval query as xe2x80x98truexe2x80x99; and
(c) discriminating a document database satisfying a condition prescribed by the modal operator added to the proposition based on the gathering result obtained in the step (b).
The storage medium according to the eleventh aspect of the present invention stores a program for realizing by a computer an operation and an effect similar to those of the information retrieval device according to the third aspect of the present invention.
The storage medium according to the twelfth aspect of the present invention stores a program for directing the computer to perform, in addition to the process containing the above described steps (a) through (c) of the program stored in the storage medium according to the eleventh aspect of the present invention, the step (d) of adding a necessity symbol to and outputting the name of the document database discriminated as satisfying the above described condition in the step (c).
The storage medium according to the twelfth aspect of the present invention stores a program for realizing by a computer an operation and an effect similar to those of the information retrieval device according to the fourth aspect of the present invention.