1. Field of the Invention
The present invention relates to document retrieval techniques for retrieving a registered document in accordance with an input query expression and displaying information of the retrieved document.
2. Description of the Related Art
In recent years, the number of electronic documents formed by a word processor and the like is increasing, and it is expected that the number of such documents increases in the future. A database used for the document retrieval is also becoming large in scale. Therefore, the set of documents, which is a search result obtained by the document retrieval is also becoming large. It is difficult for a user to find a really desired document from them.
In order to solve this problem, there is a ranking technique as the related art. The ranking technique is specifically described in xe2x80x9cRanking Algorithmsxe2x80x9d, by Donna Harman, Information Retrieval, pp. 363-392. This technique is hereinafter called xe2x80x9cRelated Art 1xe2x80x9d. Related Art 1 provides a technique of calculating a factor which shows the possibility of being similar to the contents of a query expression (sentence, document, or a sequence of words) designated by the user. An example of the contents will be described with reference to FIG. 2.
A retrieval (or search) is realized by a simple vector operation. Each element of this vector corresponds to words after excluding the overlapped words from all words appearing in the database (however, stop words and the like are excluded). In the example shown in FIG. 2, the elements are constituted of (factors, information, help, human, operation, retrieval, systems). xe2x80x9c1xe2x80x9d is set at the corresponding position if the query expression contains the element, and xe2x80x9c0xe2x80x9d is set at the corresponding position if the query expression does not contain the element. In this manner, vector Q0 of the query expression can be formed. That is, vector Q0 (1, 1, 0, 1, 0, 1, 1) is formed for query expression xe2x80x9chuman factors in information retrieval systemsxe2x80x9d.
A vector of document is similarly formed for each document in the database. Vector V1 (1, 1, 0, 1, 0, 1, 0) is formed for Document 1 containing xe2x80x9cfactorsxe2x80x9d, xe2x80x9cinformationxe2x80x9d, xe2x80x9chumanxe2x80x9d and xe2x80x9cretrievalxe2x80x9d. Vector V2 (1, 0, 1, 1, 0, 0, 1) is formed for Document 2 containing xe2x80x9cfactorsxe2x80x9d, xe2x80x9chelpxe2x80x9d, xe2x80x9chumanxe2x80x9d and xe2x80x9csystemsxe2x80x9d. Vector V3 (1, 0, 0, 0, 1, 0, 1) is formed for Document 3 containing xe2x80x9cfactorsxe2x80x9d, xe2x80x9coperationxe2x80x9d and xe2x80x9csystemsxe2x80x9d.
A score used for ranking is calculated from vector operation Vixc2x7Q0 between vector Q0 of the query expression and vector Vi (i=1, 2, 3) of each document. The calculation results are score xe2x80x9c4xe2x80x9d for Document 1, score xe2x80x9c3xe2x80x9d for Document 2, and score xe2x80x9c2xe2x80x9d for Document 3. Each score represents the similarity to the query expression judged by the system. The document having the higher score has the higher possibility of being similar to the contents of the query expression.
Instead of expressing the element of the vector as xe2x80x9c1xe2x80x9d or xe2x80x9c0xe2x80x9d, the element may be expressed by the weight of word (calculated from the location frequency of the word, the location deviation of the word in the document database, or the like). For example, if the weight of xe2x80x9cfactorsxe2x80x9d is xe2x80x9c2xe2x80x9d, the weight of xe2x80x9cinformationxe2x80x9d is xe2x80x9c3xe2x80x9d, the weight of xe2x80x9chumanxe2x80x9d is xe2x80x9c5xe2x80x9d and the weight of xe2x80x9cretrievalxe2x80x9d is xe2x80x9c3xe2x80x9d, then vector Vxe2x80x21 (2, 3, 0, 5, 0, 3, 0) can be formed for Document 1. Similarly, if the weight of xe2x80x9cfactorsxe2x80x9d is xe2x80x9c2xe2x80x9d, the weight of xe2x80x9chelpxe2x80x9d is xe2x80x9c4xe2x80x9d, the weight of xe2x80x9chumanxe2x80x9d is xe2x80x9c5xe2x80x9d and the weight of xe2x80x9csystemsxe2x80x9d is xe2x80x9c1xe2x80x9d, then vector Vxe2x80x22 (2, 0, 4, 5, 0, 0, 1) can be formed for Document 2. Furthermore, if the weight of xe2x80x9cfactorsxe2x80x9d is xe2x80x9c2xe2x80x9d, the weight of xe2x80x9coperationxe2x80x9d is xe2x80x9c2xe2x80x9d and the weight of xe2x80x9csystemxe2x80x9d is xe2x80x9c1xe2x80x9d, then vector Vxe2x80x23 (2, 0, 0, 0, 2, 0, 1) can be formed for Document 3.
The score of each document can be calculated from vector operation Vxe2x80x21xc2x7Q0 between vector Vxe2x80x21 and query expression vector Q0. The calculation results are score xe2x80x9c13xe2x80x9d for Document 1, score xe2x80x9c8xe2x80x9d for Document 2 and score xe2x80x9c3xe2x80x9d for Document 3. Each score represents the similarity to the query expression, which is judged by the system in consideration of the weight of word, i.e., the importance degree of word. The document having the higher score has the higher possibility of being similar to the contents of the query expression. That is, the search result shows that Document 1 has the highest possibility of being similar to the contents of the query expression.
In Related Art 1, the factor which shows the possibility of being similar to the contents of the query expression is calculated. By browsing the documents in accordance with this factor, the desired document can be searched at high speed from the large-scale document database. However, whether or not the search result document is really the desired document is judged by the user by actually reading the contents of the document. As the technique of supporting the instant judgement of whether or not the document obtained as the search result is really the desired document, there is the document highlighting technology which is hereinafter called xe2x80x9cRelated Art 2xe2x80x9d.
In Related Art 2, when the contents of the document obtained as the search result is displayed, a portion containing a character string of the query expression designated by the user is displayed in a display format (hereinafter called xe2x80x9ca highlightxe2x80x9d) different from that of other character string portions. The display format includes color, size, font, style (bold or roman) and the like. By displaying the portion containing the character string of the query expression in the display format different from that of other character string portions, it is possible to recognize at once the position containing the word. As a result, whether or not the document is the desired document can be judged faster than reading the document from the start thereof.
A word is often used as the element of the vector used by the ranking technique of Related Art 1. In a language such as English language in which each word is written in a delimiting manner, all words excepting stop words (such as xe2x80x9cinxe2x80x9d and xe2x80x9cthexe2x80x9d) are used as the vector elements. In a language such as Japanese language in which each word is not written in a delimiting manner, a character string obtained by dividing the different character types, consecutive n characters (xe2x80x9cnxe2x80x9d is a predetermined integer of xe2x80x9c1xe2x80x9d or larger), a word derived with reference to a dictionary or the like, and so forth are used as the vector elements. As a result, if a document or a long sentence is designated as the query expression-to execute the retrieval and the document obtained as the search result is displayed in accordance with the highlighting technology shown in Related Art 2, the number of character strings to be highlighted becomes large. Thereby, there is a problem that the important portion becomes difficult to be found.
This problem will be described with reference to FIG. 3 by taking a newspaper article database as an example. In this example, a newspaper article document regarding the stadium invitation for world cup of football is designated as the query expression to execute the retrieval.
First, character strings used for the retrieval are extracted from document xe2x80x9cFootball match stadiums for W-Cup will be determined next month, selection right attributed to Association. The organizing arrangement committee for the 2002 football world cup under the joint auspices of Japan and Korea opened on 29th, a governor/mayor meeting is held by calling special directors from fifteen local self-governing bodies which are candidates for organizing the stadium. For the number of stadiums in Japan, Federation International de Football Association (FIFA) . . . xe2x80x9d which is designated as the query expression. In the example shown in FIG. 3, nouns, katakana characters and gerunds, which are extracted by referring to a dictionary and the like, are extracted as the character string used for the retrieval. As a result, xe2x80x9cfootball, W-Cup, match, stadium, next, month, determined, selection, right, Association, Japan, Korea, joint, auspices, world, cup, organizing, arrangement, committee, place, candidates, . . . xe2x80x9d are extracted from the search expression. By using these character strings, the document retrieval is executed. The factors which shows the possibility of being similar to the contents of the query expression are calculated and output together with a list of the documents. In this state, the user browses the documents staring from the document which has the highest possibility of being similar to the contents of the query expression, i.e., the document having the highest score, and confirms whether or not the document is the really desired document. If the highlighting technology such as Related Art 2 is incorporated, the position containing the character string of the query expression can be confirmed at once, so that whether or not the document is the desired document can be judged faster than reading the document from the start thereof. However, as shown in FIG. 3, if the document or the long sentence is designated to execute the retrieval and the documents obtained as the result are displayed, since the number of character strings used for the retrieval is large, there is a large number of highlighted portions (in the example shown in FIG. 3, large font sizes, roman type, and emphasis). It becomes therefore rather difficult to find the important portions.
It is an object of the invention to realize a document information display function which allows a user to easily judge whether or not a retrieved document is a desired document.
In order to solve the above-described problems, the invention comprises the following steps.
That is, it comprises: a document retrieval step of calculating as a similarity by a predetermined calculation method a degree of similarity between contents of a query expression designated by a user and contents of a text in a text database storing document information as character code data; and a document display step of selecting important information from information used for calculating the similarity in the document retrieval step to display the selected important information.
The principle of the present invention using the above-mentioned document retrieval method will be described in the following. When a user designates a sentence or a document as the query expression in retrieving a document, the document retrieval step mentioned above is executed to calculate the similarity by the predetermined calculation method. Here, the similarity is defined as the degree of similarity between the contents of the query expression designated by the user and the contents of each text in the text database. An example of the process contents of the document retrieval step will be described. First, the predetermined character strings are extracted from the designated query expression (hereinafter called xe2x80x9ca query expression documentxe2x80x9d). As this character strings, words are used for a language such as English language in which each word is written in a delimiting manner, and for other languages, character strings obtained by dividing the different character types, consecutive character string consisting of n characters (xe2x80x9cnxe2x80x9d is the predetermined integer of 1 or larger), words derived with reference to a dictionary and the like, and so forth are used. If query expression document xe2x80x9cFootball match stadiums for W-Cup will be determined next month, selection right attributed to Association. The organizing arrangement committee for the 2002 football world cup under the joint auspices of Japan and Korea opened on 29th, a governor/mayor meeting by calling special directors from fifteen local self-governing bodies which are candidates for organizing the stadium. For the number of stadiums in Japan, Federation International de Football Association (FIFA) . . . xe2x80x9d is designated, and if nouns, katakana characters and gerunds are extracted as the character strings by referring to a dictionary and the like, then as shown in FIG. 3, xe2x80x9cfootball, W-Cup, match, stadium, next, month, determined, selection, right, Association, Japan, Korea, joint, auspices, world, cup, organizing, arrangement, committee, place, candidates, . . . xe2x80x9d is extracted. The location information of these character strings in the text database is extracted. Although the location information changes with the retrieval method to be used, the serial number of the document which the character strings appears, the position of each location, the number of locations and the like are used. In Related Art 1, the serial number of the document which contains the character strings necessary for forming the vector of the document and the number of locations thereof are used. Next, the weight of each character string is calculated from this location information by using the predetermined calculation method. Although the calculation method changes with the retrieval method to be used, this weight is calculated by using the location frequency of each character string, the location deviation of each character string in the document database or the like. The weight calculated from the location deviation is generally an IDF (Inverse Document Frequency) described in xe2x80x9cRanking Algorithmsxe2x80x9d by Donna Harman, Information Retrieval, pp.363-392. The IDF is the weight which is proposed based on the concept that there is the high possibility that the character string contained in many documents is the stop word and has the low importance degree. The similarity is calculated in the predetermined calculation method using the location information and the weight. Although the calculation method changes with the retrieval method to be used, the simple vector operation used in Related Art 1 illustrated in FIG. 2 may be used for this calculation. The calculated similarities are displayed as the search result list.
In response to the display request of the document selected in the search result list, the document display step mentioned above is executed to select the important information from the information used for calculating the similarity in the document retrieval step and display the selected important information. An example of the process contents of the document display step will be described. First, the character strings and their weights extracted from the query expression in the document retrieval step are arranged in the descending order of the weight. The upper m (xe2x80x9cmxe2x80x9d is a predetermined integer of 1 or larger) character strings of the arranged character strings are extracted. The value of xe2x80x9cmxe2x80x9d may be automatically set to the proper value by the system itself, or may be set by the user beforehand. Alternatively, the user may set and adjust the value to the appropriate value interactively every document display. An example of the extracted character strings is shown in FIG. 4. In the example shown in FIG. 4, xe2x80x9cmxe2x80x9d is set to xe2x80x9c4xe2x80x9d. The upper four character strings arranged in the descending order of the weight are extracted. As a result, xe2x80x9cW-Cupxe2x80x9d, xe2x80x9cfootballxe2x80x9d, xe2x80x9cworld cupxe2x80x9d and xe2x80x9cFIFAxe2x80x9d are extracted. Next, the display format of the portion containing the extracted character string in the document designated to be displayed by the user (hereinafter called xe2x80x9ca selected document to displayxe2x80x9d) is changed, and the selected document to display is displayed. As shown in FIG. 4, the display format of the portions containing the extracted xe2x80x9cW-Cupxe2x80x9d, xe2x80x9cfootballxe2x80x9d, xe2x80x9cworld cupxe2x80x9d and xe2x80x9cFIFAxe2x80x9d is changed (in the example shown in FIG. 4, in the large size, roman bold fonts), and the document is displayed. The user can therefore confirm the important portions in the document at once. In this process example, the important information mentioned above means the highlighting information on the important character strings, and the important character string means the upper m (xe2x80x9cmxe2x80x9d is a predetermined integer of 1 or larger) character strings of the character strings arranged in the descending order of the weight.
As described above, in this method, the character strings which affect the factor which shows the possibility of being similar to the contents of the query expression, e.g., the predetermined number of character strings as counted from the highest weight are selected. As the information on these character strings, the document is displayed after changing the display format of the portions containing the character strings. As a result, since only the information on the important character strings of the character strings used for the retrieval is displayed, the user can confirm the important portions in the document at once, and can quickly judge whether or not the document is the desired document. Therefore, the quality of the user interface for browsing the search result document can be improved.
A program realizing the above-described function or a storage medium storing such a program may be used in order to achieve the above-described object.