1. Field of the Invention
The present invention relates to a system and method for supporting information retrieval in database. Particularly, the invention is concerned with a system and method which establish a new query suitable for database retrieval on the basis of a primary query, i.e., a preliminary retrieval expression, inputted on the basis of a user""s idea in accordance with the user""s intention who executes information retrieval, and in which actual information retrieval is executed on the basis of the new query. According to the configuration of the present invention it becomes possible to effect easy and accurate information retrieval. More specifically, according to the system and method of the present invention, the user inputs a provisional primary query comprising a key word in accordance with the user""s intention independently of the database configuration, while on the basis of the primary query thus inputted the system of the present invention presents to the user candidates for the query to be used as retrieval conditions suitable for the database space, and the user establishes a query for retrieval from among the candidates thus presented, allowing retrieval to be executed by the query thus established.
2. Description of the Related Art
Heretofore, studies for information retrieval have actively been conducted as part of a natural language processing technique. An information retrieval system is generally modeled as in FIG. 1. In this conventional model it is presumed that the following three gaps, according to a broad classification, are present in information retrieval.
(1) Gap between the user""s retrieval intention and the query (retrieval expression) transcription in the system:
This gap is a difference which occurs when the user inputs and converts his or her retrieval intention (image) in accordance with a predetermined representation form. Since the retrieval intention is not clear, the presentation itself of query is in many cases difficult for those who are new to retrieval.
(2) Gap between the representation of query and a representation present in database:
In the retrieval system, matching is performed between information capable of being expressed by query and representation present in database, but there generally is a gap also between the two.
(3) Gap in relevance feedback conducted on the basis of the result of retrieval obtained:
Making reference to the result of retrieval outputted from the system, the user performs relevance feedback for approaching the retrieval information. However, it is difficult to judge whether the result of retrieval is in agreement with the user""s intention or not; further, it is not until actual execution of retrieval that the influence of a change in query becomes clear.
Problems involved in the existing retrieval systems will be enumerated below in a corresponding relation to the above description.
A. Full Text Retrieval Based on Boolean Expression as an Example
It is presumed that the full text retrieval method will solve the above-mentioned problem (2). More particularly, in the case of a word described in a sentence, retrieval can be made from the description of that word and hence the gap present between the representation of query and a representation present in database is minimized. However, since this is a word-level solution, the above point (1) is a problem to users not accustomed to the query description language.
B. Retrieval Based on Natural Language Interface
A natural language interface has been proposed to solve the above problem A. This is presumed to diminish the gap of the above (1) by inputting a phrase or sentence which the user hits upon, directly as a query. However, the representation held in the database is not always the same as the input phrase, so if matching is tried for the two, it rather results in an increase of the gap (2). Since it is difficult to observe from the user side what matching is performed internally, it rather becomes difficult to effect relevance feedback, that is, the problem (3) is also actualized.
C. Relevance Feedback Support
On the basis of the result of retrieval, certain feedback support is performed for solving the above problem (3). It is also possible to combine the above A with B. The following are mentioned as examples, which, however, cannot be regarded as satisfactory solutions.
C-1. Showing a candidate list of restricted key words to the user, allowing the user to designate a word:
Using a query and a statistical information or the like between words present in the result of retrieval, such restricted candidates as in FIGS. 2 and 3 are shown. Both examples are in an actual Internet search engine and the example shown in FIG. 2 is an example of English words in Altavista (http://altavista.digital.com), in which displayed English key words are added and retrieval is rerun, whereby restriction of data is effected. FIG. 3 shows an example of Japanese words in Excite Japan (http://www.excite.co.jp), wherein a key word is selected from additional key words present at the upper stage and is added, thereby executing retrieval and permitting restriction of data. In Japanese Published Unexamined Patent Application No. Hei 10-74210 entitled xe2x80x9cDocument Retrieval Supporting Method and Document Retrieval Service Using Same,xe2x80x9d characteristic words are extracted on the basis of, for example, the frequency of each word appearing in a document and the user is allowed to select a word in accordance with to what extent the user is interested therein.
As is seen also from the example shown in FIG. 2 or 3, as long as a simple word-level frequency or co-occurrence is based, an increase in the number of analogous words or adjacent nouns is unavoidable and thus it becomes difficult to show appropriate candidates. This is a problem common to the conventional systems laid open so far. Moreover, since it is impossible for the user to judge in what manner the word concerned is used in the document, it is difficult to judge as to whether the word is to be selected as a retrieval word or not. It is also difficult to judge how the selection will be reflected in retrieval. This is also presumed to be because all the retrieval originally relies on only such information of a small size as words.
C-2. Allowing the user to designate a document close to the user""s retrieval intention from among candidates:
An example is shown in FIG. 4. According to this configuration, as shown in the same figure, a new retrieval is executed on the basis of a feature quantity in the document designated by the user. The example shown in FIG. 4 is an example in a catalog home page retrieval of InfoNavigator (http://infonavi.infoweb.ne.jp). This system is what is called a manual catalog type system like Yahoo! for example. Since a summary is given by manual operation, it may be possible even at the summary level to judge whether document designation is to be made or not. In a robot type search engine, however, the head of a sentence is merely displayed in many cases. As to WWW document, it is impossible to specify an object and the user is not a specialist in many cases, and the judgment as to whether the page concerned is to be added to feedback or not is difficult unless the user sees actual page contents. In fact, the search of a robot collection page in the above search engine lacks this function.
In Japanese Published Unexamined Patent Application No. Hei 9-153051 entitled xe2x80x9cAnalogous Document Retrieving Methodxe2x80x9d there is shown an example of relevance feedback in ranking which uses n-gram (a character string of continuous n characters). However, it is difficult to grasp how a document selected for relevance feedback will be reflected in the result obtained. In addition, it is very troublesome to check the contents of document on the user side.
Thus, using a document as a unit of feedback results in too large an object size, giving rise to such problems as an increase in the user""s burden caused by user""s reading of the document and the necessity of keeping the reliability of a document such as a summary.
A retrieval method as a combination of the above various retrieval methods is disclosed in Japanese Published Unexamined Patent Application No. Hei 6-274538 entitled xe2x80x9cInformation Retrieval System,xe2x80x9d in which the contents of understanding of the system and the contents of generation of a retrieval expression are fed back to the user in the form of a natural language sentence. However, information based on a thesaurus or a related word dictionary in connection with the relation for use in reconfiguration is eventually handled at the word level of AND and OR, so if the user""s intention is different from the configuration of the dictionary, it is difficult to effect feedback.
Reference is here made to Japanese Published Unexamined Patent Application No. Hei 8-129554 entitled xe2x80x9cRelational Representation Extracting System and Retrieving System.xe2x80x9d According to these systems, a concept-based retrieval is conducted through a relational representation extracted from a natural language to solve the foregoing problems (1) and (2) to some extent. However, for a user familiar with Boolean retrieval and unfamiliar with a concept-based retrieval, it is sometimes difficult to show a concept-to-concept relation explicitly in a query. According to the method in question, concrete specifying of the relation rather contributes to the improvement of the relevance rate, that is, the gap is diminished. However, in the case of a simple connection using a composite word or NO-xe2x80x9cOF,xe2x80x9d there is little difference from Boolean retrieval and thus this method becomes less effective. Further, no effective solution to the foregoing problem (3) has been made so far.
Thus, it has been difficult for all of the retrieving methods disclosed in the above related art literatures to satisfy all of the gap reduction requirements at various levels in the information retrieval model shown in FIG. 1.
It is an object of the present invention to solve the above-mentioned problems involved in the foregoing related art techniques and provide a configuration for diminishing the gaps at various levels in the information retrieval model shown in FIG. 1.
First, the user, without calling for strict coincidence with his or her retrieval intention, executes a provisional retrieval request called primary query and enumerates words or a group of words which the user has hit upon. By this primary query method there is attained a decrease of the foregoing xe2x80x9c(1) Gap between the user""s retrieval intention and the query transcription in the system.xe2x80x9d
The system, upon receipt of the primary query, once holds a part of the result of having retrieved the database, as a sample space, for the words (group) thus given as the primary query. Next, for the sample space as part of the retrieval result, the system estimates a relational representation (plural words and a relation thereof) which the words (group) of the primary query can possess, and upon partial coincidence of the relational representation with the sample space the system makes expansion of the query to prepare a query candidate group categorized in accordance with a predetermined standard. This query group synthesizing configuration based on the primary query permits synthesis of a query capable of executing retrieval for data held actually in the database. In feedback retrieval, the query thus synthesized can be given as it is, as a retrieval expression, to the system. Consequently, the foregoing xe2x80x9c(2) Gap between the representation of query and a representation present in databasexe2x80x9d can be diminished.
A representation group of the expanded query candidates is presented to the user and the user can merely choose a relational representation candidate meeting his or her intention. Since the unit of selection is categorized with the relation between concepts as a unit for example, it is easy for the user to grasp a conceptual space in the object of retrieval. Thus, the foregoing xe2x80x9c(3) Gap in relevance feedback conducted on the basis of the result of retrieval obtainedxe2x80x9d can be diminished.
Until a query candidate meeting the retrieval intention of the user is presented, the user repeats the above operations, including a halfway return, and using a combination of selected query candidates, the user prepares an actual query for the actual execution of retrieval and then conducts retrieval. Thus, according to the construction of the present invention, unlike the foregoing conventional retrieval systems, it becomes possible to effect information retrieval through query candidates matching both the concept of the user""s retrieval intention and the system data space which constitutes a database. Thus, the user can easily operate a conceptual space matching the database space and executes information retrieval. In this way it becomes possible to effect information retrieval while diminishing the various gaps present in the conventional information retrieval model.
According to the present invention, which has been accomplished for achieving the foregoing object, there is provided a document retrieval system for the execution of document retrieval, comprising a primary query designating part that designates a primary query as a provisional retrieval expression, the primary query being constituted by enumeration of arbitrary words based on the intention of a user, a query candidate synthesizing part that, on the basis of the primary query designated by the primary query designating part, synthesizes a candidate group of a query capable of being designated as a document retrieval query, and a feedback indicating part that presents the query candidate group synthesized by the query candidate synthesizing part to the user and performs a relevance feedback for establishing a query selected from the thus-presented query candidate group as a query for the execution of document retrieval.
Preferably, the document retrieval system of the present invention further comprises a database which holds relational representation data included in documents and a relation expanding/reducing part that, on the basis of the primary query, extracts relation representation data corresponding to the primary query from the relational representation data held in the database, the relation expanding/reducing part being capable of expanding and reducing a relational representation range to be extracted, and on the basis of the relational representation data extracted by the relation expanding/reducing part, the query candidate synthesizing part synthesizes a candidate group of a query capable of being designated as a document retrieval expression.
Preferably, the extracted relational representation data includes a plurality of words and also includes data showing a correlation of the plural words.
Preferably, the relation expanding/reducing part used in the document retrieval system of the present invention comprises a relation estimating part that estimates a correlation of the words constituting the primary query, an expanding part that expands the constituent elements of the primary query into a relational representation on the basis of the correlation of the words estimated by the relation estimating part, and a partial coincidence retrieving part that, on the basis of the relational representation expanded by the expanding part, extracts from the database relational representation data partially coincident with the expanded relational representation.
Preferably, the relation expanding/reducing part further comprises a sample holding part that holds sample data obtained by sampling from the database, and the extraction of the relational representation data by the partial coincidence retrieving part is executed for the sample data held by the sample holding part.
Preferably, the expanding part classifies the constituent elements of the relational representation of the primary query estimated by the relation estimating part into one or more independent words (W) and relation data (R) showing a correlation of the independent words, and determines an independent word (W) for the retrieval to be executed by the partial coincidence retrieving part, or a combination of the independent word (W) with the relation data (R), and the partial coincidence retrieving part executes a partial coincidence retrieval on the basis of the independent word (W) or combination of the independent word (W) with relation data (R) determined by the expanding part.
Preferably, the relational representation expanded by the expanding part in the document retrieval system of the present invention is a representation corresponding to relational representation data cataloged beforehand as index in the database.
According to the present invention there also is provided a document retrieval method for the execution of document retrieval, comprising a primary query designating step of designating a primary query as a provisional retrieval expression which is constituted by enumeration of arbitrary words based on the intention of a user, a query candidate synthesizing step of synthesizing a candidate group of a query capable of being designated as a document retrieval query, on the basis of the primary query designated in the primary query designating step, and a feedback step of presenting to the user the query candidate group synthesized in the query candidate synthesizing step and establishing a query selected from the thus-presented query candidate group as a query for the execution of document retrieval.
The document retrieval method of the present invention further comprises a relational representation data extracting step of extracting relational representation data corresponding to the primary query from relational representation data held in a database which holds relational representation data included in documents, and the query candidate synthesizing step synthesizes a candidate group of a query capable of being designated as a document retrieval expression, on the basis of the extracted relational representation data.
Preferably, the relational representation data extracting step comprises a relation estimating step of estimating a correlation of the words which constitute the primary query, an expansion step of expanding the constituent elements of the primary query into a relational representation on the basis of the correlation of the words estimated in the relation estimating step, and a partial coincidence retrieving step which extracts from the database relational representation data partially coincident with the relational representation expanded in the expansion step, on the basis of the expanded relational representation.
Preferably, the extraction of the relational representation data in the partial coincidence retrieving step is executed for sample data held by a sample holding part that holds sample data obtained by sampling from the database.
Preferably, the expansion step comprises a step of classifying the constituent elements of the relational representation of the primary query estimated in the relation estimating step into one or more independent words (W) and relation data (R) showing a correlation of the independent words and then determining an independent word (W) for the retrieval to be executed by the partial coincidence retrieving part, or a combination of the independent word (W) with the relation data (R), and the partial coincidence retrieving step executes a partial coincidence retrieval on the basis of the independent word (W) or combination of the independent word (W) with relation data (R) determined in the expansion step.