The present invention relates in general to a range-conditional character string retrieving method and system which are capable of searching or retrieving a numerical value represented by a numeric character string by comparing or collating a value of interest with a given condition in information processing system such as a database system, a document filing system or the like for processing information or data which contains non-numeric data. More particularly, the present invention is concerned with range-conditional character string retreiving method and system suited profitably for a full-text search of document data through a range-conditional character string retrieval technique.
Heretofore, in the search or retrieval of document data, there has been adopted among others a search or retrieval method which resorts to utilization of additional information such as keywords, classification codes or the like. It is however difficult to express exactly the condition for the search or retrieval into details and localize sufficiently the items or data of interest with only the aid of the keyword and the classification code. Under the circumstances, those document data which are not intended by the searcher may unwantedly be mixedly included in the result of the search or retrieval as noise, so to say. Consequently, the searcher is utimately forced to select the document data of interest by reading directly the text or document, giving rise to a problem that the serach or retrieving processing can not be conducted with satisfactory efficiency. Besides, as the amount of document data increases, labor required for indexing such as involved in affixing the keyword and the classification code is intolerably increased, incurring significant delay in registration of document data. It is additonally noted that the meanings or contents of the keywords and the classification codes tend to change to out-of-dateness as the time lapses, presenting difficulty in maintaining the database in the up-to-date state.
As an approach to solve the problems mentioned above, there has been proposed a method of collating comparatively the contents of a text in a document with keywords inputted arbitrarily by the user while scanning the text (this method will hereinafter be referred to as the full-text search). In this conjunction, reference may be made to R. L. Haskin and L. A. Hollaar "Operational Characteristics of a Hardware-Based Pattern Matcher", ACM Trans. on Database Systems, Vol. 8, No. 1 (1983).
FIG. 2 of the accompanying drawings shows an example of the character string seraching or retrieving system which is based on the full-text search procedure. Referring to the figure, a searcher 401 designates a retrieval keyword 325 as the condition for retrieval which is to be inputted to a host computer 400. In response to the retrieval keyword 325, the host computer 400 transfers retrieval control information 321 to a character string collating circuit 313 of the character string retrieving system 300, whereupon a storage control circuit 311 of the latter is activated to read out a character string 20 to be subjected to retrieval from a character string storage unit 312 by using control information 323 and send the character string 20 to the character string collating circuit 313. In the character string collating circuit 313, the input character string 20 is collated with a specific character string set previously to serve as the retrieval control information 321, wherein upon detection of a character string in the string 20 which coincides with the specific character string, retrieved result (i.e. result of the retrieval) indicated at 45 is sent to the host computer 400. The host computer 400 then displays document information 326 corresponding to the retrieved result 45 to the user 401.
At this juncture, it is noted that the full-text search is designed not only for retrieval of the non-numeric character string such as alphabetic letter string but also for the retrieval of numeric character strings by which numerical values falling within a specific range are retrieved at one time (referred to as the range-conditional retrieval or retrieving). Assuming, by way of example, that a range condition defined by "15.ltoreq.K.ltoreq.142" is designated, then the whole document containing descriptions related to numerical values covered by the range of "15" to "142" is subjected to the retrieval.
As one of methods for realizing the full-text search, there is known a method in which a finite automaton or automata are used, and a character string retrieving system capable of performing the range-conditional retrieval in the full-text search by using the finite automaton technique is disclosed in U.S. Pat. No. 4,241,402.
In a character string collating circuit 313 of the. character string retrieving system disclosed in the abovementioned U.S. Patent, the range-conditional retrieval or search is realized by resorting to the finite automaton technique. More specifically, in the full-text search, the numerical values are stored in the character string storage unit 312 together with non-numeric character in the form of character codes, being justified to the left, wherein the character string 20 to be subjected to retrieval is inputted to the character string collating circuit (matcher) 313 on a one-by-one character basis to thereby decide if a numeric character string representing a numerical value satisfying the range condition is present in the input character string 20.
A structure of a finite automaton for realizing the range condition "15.ltoreq.K.ltoreq.142" is illustrated in FIG. 3 of the accompanying drawings for the purpose of exemplification. This automaton is so structured that state transition thereof may occur every time one character is inputted for retrieval of the numerical value falling within the aforementioned range. Let's assume, for example, that a character string of ", 40," is contained in a document of concern and that the finite automaton is initially in the state 0 (zero). On the assumption, input of "," brings about no transition from the state 0, since this symbol or token represents no numerical value. However, upon inputting of "4", state transition to the state 2 takes place. When "0" is inputted in succession, the automaton goes to the state 6. Finally, upon inputting of "," decision is made that the numeric character string "40" represents a numerical value which satisfies the aforementioned range condition, whereupon transition is made to the state 0 indicated as enclosed in double circles (which is a reference symbol indicating detection of a numerical value satisfying a given range condition). The retrieved result (i.e. result of the retrieval) 45 is sent to the host computer 400.
In the character string retrieving system using the finite automaton such as described above, it is necessary for reducing the wait or latency time in order to
(1) shorten the time taken for generating the finite automaton corresponding to the given range condition and
(2) speed up the collation between the range condition and the input character string read out from the character string storage unit 312.
In the character string retrieving or searching system 300 described above, it will however be noted that when numerical values satisfying a given range conditions are to be searched by the finite automaton in the character string collating circuit 313, such finite automaton has to be generated which has state transition paths (forkings) corresponding to all numerical values covered by the given range condition. This in turn means that as the digit number of the numerical value (i.e. number of the characters constituting the numeric character string representing the numerical value) increases, the finite automaton structure becomes much complicated, presenting a problem that a lot of time is required for creation of the automaton. Besides, such complex finite automaton requires an increased capacity of the state transition table for storing the automaton. In this conjunction, it will be understood that the character string collating or retrieving speed is determined by the time taken accessing the state transition table. The state transition table of a large capacity in turn increases tcorrespondingly he access time, as a result of which limitation is necessarily imposed on the character string collation speed which would be about 100 ms per character or token at the highest.
Furthermore, there exists a demand for such applicability of the character string retrieving system that the range-conditional retrieval is to be carried out for a code such as comercial product identification codes and others in which numeric character strings and non-numeric character strings (such as alphabetic letters, Chinese characters, Japanese cursive and/or square character) are coexistent in a character string. Assuming, for example, that the range-conditional retrieval is to be performed on a character string containing alphabetic letters affixed to numerical values, there will be required as many as twenty-six state transition paths, corresponding to "A" to "Z", respectively, making the finite automaton complicate remarkably with the time required for creation thereof being equally extended to serious disadvantage. As a result of this, the state transition table storing such automaton has to be increased in the capacity to another disadvantage. Furthermore, since the character string collating speed is naturally limited by the time involve in accessing the state transition table, limitation is imposed on the effort to increase the search or retrieving speed, to further drawback.