The present invention relates to a method and apparatus for high-speed searching of non-structural data in a system such as a data base or a document filing system in which information including non-numeric data is processed. More particularly, the present invention relates to a symbol string search method which is suitable for full-text searching of document data in a character string search, an apparatus which realizes such a method and a semiconductor integrated circuit which is contemplated as such an apparatus.
As the storage capacity of an information processing system shows a yearly increase, the rate of processing for handling of non-numeric data represented by document data has become high. From such a background, the importance of a processing which can make a high-speed and throughout search for a desired document or data from a large capacity database increases.
Conventionally, a method of using supplementary information such as keywords or classification codes has been generally employed for the search of document data. However, it is difficult to strictly represent fine conditions of a search by means of only the key-words or classification codes and hence it is hard to make a restriction to a desired document or data. Accordingly, in this method, documents which are not desired by a searcher may be included as search noises. Therefore, there is a problem that the searcher must ultimately read the text directly to select his desired document data, thereby lowering the efficiency of the search processing. Further, the amount of an indexing work for adding keywords or classification codes is increased with the increase of document data. This causes delayed registration of document data. Also, the keyword or classification code may often change in meaning to go out of date with the times. This makes it difficult to maintain the up-to-dateness of the data base.
In order to solve the above problems, a method has been proposed in which a collation with an arbitrarily chosen set of keywords is made while scanning the text of document. This method will be hereinafter referred to as full-text search.
One example of a character string search system based on the full-text search has been disclosed by R. L. Haskin and L. A. Hollaar, "Operational Characteristics of a Hardware-Based Pattern Matcher", ACM Trans. on Data Base Systems, Vol. 8, No. 1, 1983.
FIG. 26 shows a character string search system disclosed by the Haskin et al's article. The character string search system 300 is connected to a host computer so that a search request 320 and the result 324 of search are communicated between the system 300 and the host computer. If the search-request 320 is sent from the host computer, a search controller 310 receives the search request to analyze it and sends search control information 321 to a term comparator 313 and a query resolver 314. The search controller 310 also controls a storage controller 311 so that character string data 322 stored in a character string storage device 312 is transferred to the term comparator 313. The term comparator 313 compares the inputted character string data 322 with a preset character string and outputs detection information 323 to the query resolver 312 when the relevant character string is detected. The query resolver 314 examines whether or not the detection information 323 matches with a complex condition such as a proximity between character strings shown by the search request. In the case where the matching is obtained, the query resolver 314 outputs identification information of the corresponding document data or the contents of the document as the result 324 of search which is in turn sent to the host computer.
One way of the full-text search performed by the term comparator 313 is a method which uses a finite state automaton. According to this method, it is possible to make a search by scanning a text once irrespective of the number of keywords. An example of the method using the finite state automaton has been disclosed by A. V. Aho and M. J. Corasick, "Efficient String Matching", Comm, ACM, Vol. 18, No. 6, 175.
Since the method using the finite automaton can also realize a variety of approximate searches such as a search including "don't care" characters or a search including erroneous characters, this method is a technique which is effective for the full-text search. The "don't care" character means an arbitrary character (or any character).
The term "automaton" referred to in the present specification means a machine which makes a transition from a certain state to another state (or to itself) when any given transition condition is inputted.
An algorithm for realizing the high-speed full-text search based on the finite state automaton and a means for implementation of it have been disclosed by, for example, JP-A-63-311530.
The most fundamental system for a search using the finite automaton is a perfect state transition system in which states are allotted for respective transitions associated with characters in a set character string and all possible transition paths are given between the states. In this system, transition processing for one character of an inputted character string data can be surely performed for one machine cycle. However, as the length of a character string increases, the number of states and hence the number of state transition paths increase. Therefore, there arises a problem that an automaton generation time becomes long.
So, A. V. Aho et al have proposed a sequential repetitive fail system into which the concept of a "fail processing" is introduced. The fail processing is a processing performed in the case where a mismatching occurs in a processing for judgement of the matching/mismatching of inputted character string data with a character in a set character string of interest which is to be searched out. According to this system, the number of state transition paths can be greatly reduced. However, there is a problem that it does not always follow that a transition processing for one character of the character string data can be performed in one machine cycle.
An anticipatory fail system disclosed by the above-mentioned JP-A-63-311530 offsets the defects of the perfect state transition system and the sequential repetitive fail system. In the anticipatory fail system, a fail processing as the provision for the case of occurrence of a fail is always performed concurrently with or in parallel with a usual transition processing so that a state of destination for transition is changed in the case whence the occurrence of the fail is detected. This system makes it possible to generate an automaton in a relatively short time and to perform a processing for one character of character string data in one machine cycle.
However, in the conventional full-text search using the finite automaton as mentioned above, a state transition in each cycle is made with continual reference to a state transition table. Usually, the-amount of data in the state transition table is large so that the state transition table is stored in a memory of a chip other than a semiconductor integrated circuit which controls the execution of a finite automaton. Therefore, there is a problem that access to the memory is required at every cycle, thereby arresting the improvement of a processing speed.
As has been mentioned above, in the document search relying on the conventional full-text search using the automaton, since a large scale state transition table is required, it is necessary to store the table in a chip other than a semiconductor integrated circuit which controls the execution of the automaton. Accordingly, there is a problem that the improvement of a processing speed cannot be expected since the input/output of data between the automaton executing means and the table memory is always made.
The problem of the input/output of data is avoided by a similar document search apparatus which uses a cellular array. Such an apparatus has been disclosed by, for example, JP-A-62-217321. In this apparatus, however, problems including a circuit delay caused by the increase of the number of character strings and a shift delay caused by the increase of the character string length arise in broadcasting data to be inputted to each cell. Also, when an ambiguous or approximate search such as a search for character strings including "don't care" characters or a search for character strings including erroneous characters is to be realized, the amount of hardware generally has an increasing tendency. Further, since the upper limit of the word length of a settable search character string to be searched out is restricted by the amount of hardware, this apparatus is inferior to the automaton system in the flexibility of the search processing.
Consider, by way of example, the case where "daiyoryo" (" " in Japanese and "large capacity" in English) is designated as a character string to be searched out and the error of one character (in Japanese) is allowed for a normal character string "daiyoryo". The designated character string "daiyoryo" consists of three characters of "dai", "yo" and "ryo". In this case, the replacement of one character, the insertion of one character and the omission of one character are allowable. Now provided that any character other than {dai} is indicated by {dai}, any character other than {yo} by {yo}, any character Other than {ryo} by {ryo}, and any one character (or "don't care" character) by ?, the execution of the search allowing the error of one character for the set character string "daiyoryo" requires to search for or search out nine character strings as follows:
K.sub.1 : diayoryo PA1 K.sub.2 : daiyo {ryo} PA1 K.sub.3 : dai {yo}ryo PA1 K.sub.4 : {dai}yoryo PA1 K.sub.5 : daiyo?ryo PA1 K.sub.6 : dai?yoryo PA1 K.sub.7 : daiyo PA1 K.sub.8 : dairyo PA1 K.sub.9 : yoryo.
When a finite automaton (as shown in FIG. 29 which will be explained in later) for making a search for these character strings in accordance with the full-text search is generated, the input of a character including a negation condition appears as a state transition condition. Such a state transition is called an exclusive transition. Accordingly, the ability to detect the exclusive transition condition is required. Thus, the possession of a function of setting the negation condition becomes necessary in order to realize the ambiguous or approximate search such a search which allows erroneous characters.
The possession of the negation condition setting function makes it possible to eliminate unnecessary results of search or to suppress so-called search noises. Now consider by way of example the case where a text including a character string "kinzokugenshi" (" " in Japanese and "metal" in English) or "dotai" (" " in Japanese and "conductor" in English) is searched. In the collation, however, a text including "hikinzoku" (" " in Japanese and "non-metal" in English may be detected in connection with the partial character string of "kinzoku" or a text including "handotai" (" " in Japanese and "semiconductor" in English may be detected in connection with the partial character string of "dotai". Though "hikinzoku" or "handotai" involves the set partial character string of "kinzoku" or "dotai", it is another character string having a meaning different from the set partial character string. Depending on the purpose of a search, such other character strings may appear as unnecessary results of search or so-called search noises in the full-text search. Such search noises can be removed by using a negation condition to make the more restrictive setting of a partial character string as follows: EQU "kinzoku".fwdarw." {hi}kinzoku" EQU "dotai".fwdarw." {han}dotai".
Thereby, it is possible to prevent "hikinzoku" or "handotai" from appearing from the search noise. Many examples of such a setting exist. For example, in connection with "suchishori" (numeric processing) or "teigigo" (defined word), the following setting may be made: EQU "suchishori".fwdarw." {hi}suchishori(non-numeric processing)" EQU "teigigo".fwdarw. {mi}teigigo(undefined work)".
If the negation condition is thus set, it becomes possible to suppress the search noises.
However, the conventional system is not provided with a negation condition setting function. Therefore, there is a problem a one-character error allowable search or a restrictive search for noise suppression is not possible.