1. Field of the Invention
This invention relates to a method of retrieving data and a data retrieving apparatus.
2. Description of the Prior Art
A data retrieving apparatus for retrieving data from full text data is known. A data retrieving apparatus for retrieving data with index data is disclosed in Japanese patent application provisional publication No. 7-56943. In this data retrieving apparatus, special position marks are inserted Just before and Just after a predetermined character string and an index is generated.
The aim of the present invention is to provide a superior method of retrieving data and a superior data retrieving apparatus.
According to the present invention a first aspect of the present invention provides a method of retrieving first and second candidate data in full text data including no word separation data, comprising the steps of: (a) dividing the full text data into words and thereby generating word separation data; (b) generating and storing index data including the steps of: (c) extracting all character strings from the full text data, each character string including N characters, N being a natural number; and (d) attaching the word separation data and character position data of each of the character strings to each of the character strings to generate the index data; (e) inputting query data with segmentation indicative of leading and trailing ends of the query data; (f) detecting agreement in word retrieving, said (f) including steps of: (g) collating the query data with each of the character strings in the index data to detect character agreement; (h) collating the segmentation of the query data with the word separation data of each of the character strings to detect segmentation agreement; (i) outputting the character position data of one of character strings showing the character agreement and the segmentation agreement; and (j) detecting agreement in character string retrieving, said (j) including steps of: (k) collating the query data with each of the N characters in the index data; and (l) outputting the character position data of one of the character strings showing only the character agreement, wherein either of the step (f) or step (j) is effected in accordance with a selection command and the index data is commonly used in the steps (f) and (j).
Preferably, the step (a) includes a step of: generating the word separation data to have leading and trailing end data of each of the words and in step (h). The segmentation of the query data is compared with the leading and trailing end data of each character string, and in step (i). The position data of the first candidate data is outputted when the segmentation of the query data agrees with the leading and trailing end data of the one character string. Moreover, in this case, the step (a) further includes step of: checking whether a first character having a first order in one of the character strings has leading and trailing ends; attaching the leading end data to one of the character strings with respect to the first character when the first character has the leading end; attaching the trailing end data to one of the character strings with respect to the first character when the first character has the trailing end; checking whether a second character following the first character has a trailing end; attaching the trailing end data to the one of the character strings with respect to the second character when the second character has the trailing end.
Preferably, both the steps (f) and (j) are effected in accordance with the selection command.
Preferably the method further comprise steps of: dividing-the query data into query character strings, each query character string includes N query characters, the step (g) being executed for the query character strings to obtain collating results of the query character strings, respectively; estimating continuity of the character strings showing the character agreement with the query character strings in accordance with the position data of the character strings showing the character agreement, the step (h) being executed with respect to the word separation data Just before the first character and the word separation data Just after the last character of the character strings showing the character agreement and the continuity, wherein in step (i) the position data of the first candidate data is outputted when there is the continuity and the word separation data of the first and the last characters of the character strings agrees with the segmentation of the word separation data of the first and the last characters. In this case, the segmentation agreement is detected in either of first to fifth modes in response to a mode command, in the first mode, the segmentation agreement is established when the segmentation of the first and the last characters of the query data agrees with the word separation data Just before the first character and the word separation data Just after the last characters of the character string showing the character agreement; in the second mode, the segmentation agreement is established when the segmentation of the first and the last characters of the query data agrees with the word separation data just before the first character and the word separation data just after the last characters of the character string showing the character agreement and when the segmentation of only the first character of the query data agrees with the word separation data just before the first character of the character string showing the character agreement; in the third mode, the segmentation agreement is established when the segmentation of the first and the last characters of the query data agrees with the word separation data just before the first character and the word separation data just after the last characters of the character string showing the character agreement and when the segmentation of only the last character of the query data agrees with the word separation data just after the last character of the character string showing the character agreement; in the fourth mode, the segmentation agreement is established when the segmentation of only the first character of the query data agrees with the word separation data just before the first character of the character string showing the character agreement; and in the fifth mode, the segmentation agreement is established when the segmentation of only the last character of the query data agrees with the word separation data just before the first character of the character string showing the character agreement.
Preferably, the method further comprise the steps of: detecting a condition of each word in the full text data; and judging whether each word is a non-target word in retrieving in accordance with the condition. In the step (d), the word separation data is not attached to the one character string including the non-target word when one of the words is judged as a non-target word and the segmentation agreement is not effected when the word separation data is not attached to the one character string.
Preferably, the method further comprise the steps of: detecting a condition of each word in the full text data; and judging whether each word is a non-target word in retrieving in accordance with the condition, wherein in the step (d), the leading and trailing end data of the word separation data is not attached to the each character string when one of the words is judged to be a non-target word and the segmentation agreement is not detected when the word separation data is not attached to the one character string.
Preferably, the method further comprise the steps of: detecting a prefix and a suffix of each word in the full text data, wherein the leading end data is not generated as the word separation data when the previous word of one of the words is prefix and the trailing end data is not generated as the word separation data when the following word of one of the words is suffix. In this case, the method further comprise the steps of: detecting a word class of each word in the full text data to detect the prefix and the suffix.
Preferably, a frequency of appearance of each word in the full text is detected, wherein one of words is judged to be the prefix and suffix in accordance with the frequency.
Preferably, the method, further comprise steps of: numerically evaluating the results of the steps of (f) and (j). The first and second candidate data is retrieved in sets of the full text data having document identification data. The method further comprises the steps of: ordering the sets of the full text data in accordance with the results of the steps of (f) and (j) of the sets of the full text data; and outputting the document identification data of the ordered full text data. In this case, the both steps of (f) and (j) are executed, the method further comprising the step of: weighting the results of the steps (f) and (j) with different first and second coefficients, respectively.
In this case, the first and second coefficients are determined such that any set of the full text data having the lowest numerically evaluated result of the step (f) is ranked higher than any set of the full text data having the highest numerically evaluated result of the (j).
In the word retrieving mode, there may be first to fifth modes. The method may further comprise steps of: weighting the results of the step (f) with first to third different coefficients in the first to third modes, respectively numerically evaluating the results of the steps of (f) and (j), wherein the first and second candidate data is retrieved in sets of the full text data having document identification data. The method further comprise the steps of: ordering the sets of the full text data in accordance with the results of the steps of (f) and (j) of the sets of the full text data; and outputting the document identification data of the ordered full text data. In this case, the method further comprises the steps of: inputting ordering commands for ordering the first to third modes; generating the first to third coefficients in accordance with the ordering commands such that one of the first to third coefficients of which mode is the most highly ordered has a highest value, another of the first to third coefficients of which mode is the lowliest ordered has a lowest value, the other of the first to third coefficients of which mode is intermediately ordered has an intermediate value. In this case, the first and second candidate data is successively retrieved in each set of the full text data having document identification data. The method further comprise the steps of: classifying the sets of full text data into first to third groups such that the first group of the full text data includes the candidate data most highly ordered mode, the second group of the full text data includes the candidate data intermediately ordered mode but does not include the candidate data most highly ordered mode, and the third group of the full text data includes the candidate data lowliest ordered mode but does not include the candidate data most highly ordered and intermediately ordered modes; ordering a first portion of the sets of the full text data in each of the first to third groups every the group in accordance with the number of pieces of the first candidate data retrieved in step (f) in respective full text data of the first portion; and ordering a second portion of the sets of the full text data in which only the second candidate data is retrieved in step (j) in accordance with the number pieces of the second candidate data.
The query data may include a plurality of quarry character strings and at least an operator indicating operation among a plurality of query character strings are inputted, wherein in the step of ordering, the each of full text data is ordered in accordance with each of the query character strings, the method further comprising the step of: finally ordering the sets of the full text data in accordance with the ordering result of the sets of the full text data and the operator.
According to the present invention a second aspect of the present invention provides a data retrieving apparatus for retrieving first and second candidate data in full text data including no word separation data, comprising: dividing means for dividing the full text data into words and thereby generating word separation data; generation and storing means for generating and storing index data including: extracting means for extracting all character strings from the full text data, each character string including N characters, N being a natural number; and attaching means for attaching the word separation data and character position data of each of the character strings to each of the character strings to generate the index data; inputting means for inputting query data with segmentation indicative of leading and trailing ends of the query data; first detecting means for detecting agreement in word retrieving including: first collating means for collating the query data with each of the character strings in the index data to detect character agreement; second collating means for collating the segmentation of the query data with the word separation data of each of the character strings to detect segmentation agreement; first outputting means for outputting the character position data of one of character strings showing the character agreement and the segmentation agreement; and second detecting means for detecting agreement in character string retrieving including: third collating means for collating the query data with each of the N characters in the index data; and second outputting means for outputting the character position data of one of the character strings showing only the character agreement, wherein either of the first detecting means or the second detecting means is operated in accordance with a selection command and the index data is commonly used in the first and second detecting means.