When performing quantitatively constrained work such as designing industrial products, synthesizing chemical substances, or prescribing drugs in medicine, search means for searching for a document in which rules or guidelines including acceptable numeric ranges for the work are written is important to quickly and comprehensively refer to the rules or guidelines. In such a document, numeric ranges are not necessarily written in a fixed format like a list of minimum and maximum values, but written in various expressions in a plurality of parts of the document using a natural language.
In the case of searching for a document using one or more numeric values or ranges as criteria, a typical keyword search system or full-text search system cannot comprehensively obtain a document conforming to the criteria. For example, in the case where “10 g” is used as the criteria, a document in which “greater than or equal to 10 g and less than or equal to 15 g” is written can be retrieved, but a document in which “greater than or equal to 5 g and less than or equal to 20 g” is written cannot be retrieved because the document does not include the character string “10 g” despite the numeric range conforming to the criteria.
Patent Literature (PTL) 1 discloses an index generation system that adds information for numeric range search to an index for full-text search. In this system, an exponential part of a numeric value in a document is computed and an index in which the exponential part is added to a part of the elements constituting the index is generated, thus enabling numeric range search and full-text search to be performed by the same mechanism without preparing special information (another independent index) for numeric range search.
For example, in a document including a numeric range “380 m to 760 m”, “380” and “760” are expressed as (“38”, [18 (+2)]), (“80”, [19]) and (“76”, [22 (+2)]), (“60”, [23]) using each combination of a 2-gram character string (character string of two consecutive characters), an appearance position, and an exponential part (only for the first 2-gram of each numeric value), and stored in a word index.
In the case of performing full-text search on this document using “greater than or equal to 387.6 m” as search criteria, “387.6” in the search criteria is divided into “38” and “7.6”. Their exponents are respectively +2 and −1. These are compared with the exponential parts in the word index, and each record with a greater or equal exponential part (because “greater than or equal to” is included in the criteria) is extracted as a candidate. Since the search criteria in the above-mentioned example include “greater than or equal to”, a number of a value greater than the number in the search criteria is a match (in the case of the same exponent). The numeric values in the above-mentioned document that are “greater than or equal to” the first value (“38”, [0 (+2)]) in the search criteria are (“38”, [18 (+2)]) and (“76”, [22 (+2)]), as a result of magnitude relationship comparison for each of the numeric value character string and the exponential part. Likewise, the values that are “greater than or equal to” the value (“7.6”) other than the first value are (“80”, [19]) and (“60”, [23]). Concatenation determination means determines, for example, that (“76”, [22 (+2)]) and (“60”, [23]) are concatenated to form the value “760”, so that “760 (m)” is recognized as a value satisfying “greater than or equal to 387.6 m”.
In a method described in PTL 2, for a set D of numeric values of data of a specific type, a set E of numeric intervals including all elements of D is generated, and numeric values in a document are indexed by assigning 1 to an interval including an element x of D and assigning 0 to an interval not including the element x. Such an index is generated for each data type, to achieve accurate and efficient search for numeric data in the document.
For example, consider the case of searching for data of other patients similar in condition to a patient A, where one of the search criteria is that the systolic blood pressure of the patient A is 140. Both a patient B whose systolic blood pressure is 125 and a patient C whose systolic blood pressure is 155 are equally close in blood pressure to the patient A. However, since the patient B is in an interval of normal blood pressure whereas the patients A and C are in an interval of high blood pressure, they need to be distinguished.
The method described in PTL 2 can improve the search accuracy in such an example, by distinguishing the numeric interval corresponding to the numeric value set of normal blood pressure and the numeric interval corresponding to the numeric value set of abnormal blood pressure (high blood pressure). A potential improvement in search efficiency is also suggested on the ground that, as a result of separating the numeric intervals, the similarity to the search criteria only needs to be evaluated for documents in which numeric values included in the corresponding numeric interval are written.