1. Field of Applicable Technology
The present invention relates to an apparatus based on a finite state automaton, (hereinafter abbreviated to FSA) for scanning successive characters of a text to detect specified strings of characters or other specified patterns in the text.
2. Description of the Related Art
In recent years, with the widespread use of word processors and text databases, various techniques have been developed for searching through a stored text in order to access a desired portion of the text, or to find one or more specified character strings or other specified patterns within the text. One prior art method involves the use of key words, in which respective positions of documents or document portions within the stored text are represented by corresponding key words which are known to appear in the documents or document portions. The key words, and information indicating their positions within the text, are held in a register. Thus, a user can institute an operation to search for and read out a desired portion of the text, by inputting the correct key word. However such a method has various disadvantages, such as the work which is necessary to assign and store the key words, the increased storage capacity that is required due to the need to provide the key word register, and also the possibility of errors or delays resulting if a user uses an incorrect key word or misspells a key word.
For such reasons, a type of search referred to as a full-text search or full-text scan has received attention as an efficient technique for searching through a stored text. With a full-text search method, if for example a specific character string is to be located within the text, or if a part of the text is to be located and read out, which is known to contain any specific character string (e.g. in a header portion), then the only information which is required by the user to institute the search is that character string itself. The search is executed by successively reading out each character (represented as binary code number) from memory, starting from the head of the text or the head of a specified part of the text, to detect the sequence of characters constituting the specified string. Successive characters of the text are stored in sequentially numbered addresses of the text memory, starting from the head of the text, so that when the specified string is detected, the position of that string within the text is known. Information such as the position of the specified string, or a portion of the text which follows the specified string, can thus be provided to the user.
The term "character" as used herein applies not only to alphanumeric or other characters, but also to punctuation symbols such as commas, and inter-word spaces, which are also expressed by respective binary code values.
With a full-text search method, a user is not restricted only to searching for specific character strings, but can also specify a combination of search conditions, and can request that only a portion of the text which satisfies all of the search conditions is to be searched for. In the case of a newspaper database for example, an example of such search conditions might be: Find a newspaper article for which the header of the article contains the word "Russia" or "Soviet" while the body of the article contains the word "Gorbachev" or "Gorbie", and the field code for the article does not contain the designation "economics".
With a full-text search apparatus, it is possible to search for a pattern, rather than for a specific string. Such a pattern may be expressed in a so-callred "regular expression" for example so that a command can be supplied to the search apparatus which is of the form:
Find a part of the text which matches the search condition "a[.sup. , .]* g".
The apparatus would respond by finding a pattern in the text such that the character "a" is immediately followed by zero or more characters, which is neither a comma nor a period, with that character (or characters) being immediately followed by the character "g". Such a condition would for example be satisfied by any of the strings "acting", "adopting", etc.
Such full-text searching is applicable not only to large-scale databases, but also to editing operations with a word-processor, for example to search through a text to find misspelled words. Full-text searching can be applied advantageously to text in various languages other than European languages, for example to Japanese or Chinese.
However full-text search has had the disadvantage in the prior art that, since it is necessary to successively examine each of the characters of a text in order to find a character string or pattern which satisfies predetermined conditions, a relatively long time is required to execute a search, by comparison with a method using key words. Various proposals have been made in the prior art for techniques and algorithms to reduce that disadvantage. Some of these are summarized for example in "Access Methods for Text", by Christos Faloutsos, Computing Surveys, Vol. 17, No. 1, March 1985, and in an article "Text Searching Processor", published in a document by the Institute of Electronics, Information and Communication Engineers (EIC) Japan, December 1991, ISBN4-88552-103-3.
The present invention is concerned with a full-text search apparatus which is based on a Finite State Automaton (hereinafter referred to as FSA). Use of a FSA has advantages such as enabling a number of different character strings or patterns to be searched for at the same time, during a single pass through the text which is searched. In addition, the technique is very suitable for application to the use of regular expression search commands for designating strings or patterns that are to be matched, such as the "a[.sup. , .]* g" example given above.
FIG. 2 illustrates the basic principles of a finite state automaton search apparatus which utilizes the finite state automaton technique. Prior to executing a search, a state number table is initialized by setting therein a set of numbers, referred to as state numbers, which correspond to respectively different states of a search operation, with specific relationships between the values of state numbers and the addresses (i.e. table entry locations) at which they are set. During the search operation, text characters (expressed as respective binary code values, referred to in the following as character numbers) are sequentially supplied to the apparatus, starting from the head of the text, with each character number being supplied during a fixed interval. During that interval, the character number is combined with a state number which has been read out from the state number table immediately previously, to obtain the address of the next state number.
FIG. 3 illustrates a text search example, whereby the two character strings "CAT" and "DOG" are to be searched for simultaneously. The symbol " " before a character signifies "not". There are six states, having respectively different state numbers assigned thereto. S0 denotes an initial search state, which is the default state. Prior to the search, the state number table shown in FIG. 2 is initialized in accordance with the search conditions, for example such that the state number S1 is set at a table entry whose address is a combination of the state number S0 and the character number for "C", and the state number S5 is set at a table entry whose address is a combination of the state number S3 and the character number for "T", and also at an entry whose address is a combination of state number S4 and the character number for "G".
The portion of the state number table in FIG. 2 containing the respective sets of "next state" numbers for transitions from each state in this example is illustrated in Table 1 below.
TABLE 1 __________________________________________________________________________ S4 S0 S0 S0 S0 S0 S0 ##STR1## S0 . . . S0 S0 S0 S0 S0 S0 S0 S3 S0 S0 S0 S0 S0 S0 S0 . . . S0 . . . ##STR2## S0 S0 S0 S0 S0 S0 S2 S0 S0 S0 S0 S0 S0 S0 . . . ##STR3## S0 S0 S0 S0 S0 S0 S0 S1 ##STR4## S0 S0 S0 S0 S0 S0 . . . S0 . . . S0 S0 S0 S0 S0 S0 S0 S0 S0 S0 ##STR5## ##STR6## S0 S0 S0 . . . S0 . . . S0 S0 S0 S0 S0 S0 S0 A B C D E F G . . . O . . . T U V W X Y Z __________________________________________________________________________
The search operation is performed as follows. Initially, the character number C0 is generated as the current state number, and in that condition, the first character number of the text is supplied, to obtain an address for the state number table . If the first character is any other character than "C" or "D", then the address that is generated will be that of the default state number S0, which is then read out from the state number table as the next state number. That state number is then supplied as the current state number, to be combined with the next character number of the text to obtain a new address. If the next character is for example "C" then the table location of state number S1 will now be specified, and that state number will be read out as the next state number.
Any next state number to which a transition (other than a default transition) can occur from the current state number, will be referred to in the following as a "success state number" with respect to the current state number. An input character which results in a transition to a success state number will be referred to as a success transition character with respect to the current state number, (as opposed to a default transition character).
The above operations are successively performed until the "final success" state S5 is reached, i.e. until the state number S5 is read out from the state number table, indicating that the string "CAT" or "DOG" is found. In such a search, the state of the search operation at any point in the search is dependent upon a sequence of previous states. The search operation can be continued until all of the text characters have been examined, or the operation can, for example, be controlled such that the search is halted when the "final success" state number is read out.
It can be seen from Table 1 above, the state number table contains a large number of default (S0) entries, and only a few of the success state number entries. For example in the state S1, if the input text character is any other then "A, then it is necessary that the state number SO be read out from the state number table, i.e. it is necessary to provide respective S0 table entries for the complete range of characters other than character "A". Thus there will be a very large number of default state (S0) entries in the state number table. For simplicity, only the set of 26 upper-case alphabetic characters are shown in Table 1 above, and it can be understood that with a complete alphanumeric character set for example, the actual number of S0 table entries in the complete state number table would be considerably greater.
It will thus be clear that the greater the range of characters in the character set, the greater will be the amount of storage capacity required to implement the state number table, i.e. as a state number table memory.
FIG. 1 is a general block diagram of an example of such a prior art type of full-text search apparatus utilizing the finite state automaton technique. The text that is to be searched is stored beforehand as a set of character numbers in an input text memory 301, which can be a RAM or disk storage device, with the character numbers held in respective numbered locations. Numeral 302 denotes a state number table memory, which implements the state number table of FIG. 2 described above, and can be configured from a RAM. Part of the bits of each address of the state number table memory 302 are constituted by the character number that is currently being read out from the input text memory 301, while the remaining address bits are constituted by the current state number, which is being read out from a transition state number register 303. An operation number table memory 304 receives the current state number from the transition state number register 303 as an address, and serves to generate a corresponding operation number, which is supplied to a search control system (e.g. based on a CPU, not shown in the drawing) which functions in response to the search results to perform various control operations. The search control system might for example halt a text search when a specific operation number is outputted from the operation number table memory 304, and/or store the current address value of the input text memory 301, indicative of the search position that has been reached in the text.
The operation of the apparatus of FIG. 1 is in accordance with the basic operation described above referring to FIG. 2. Prior to beginning a search, the text characters are successively written into the input text memory 301, and the contents of the state number table memory 302 are initialized in accordance with the required search conditions. In addition, the contents of the operation number table memory 304 are initialized such that operation numbers which will be recognized by the search control system as expressing respective known statuses of the text search will be read out from the transition state number register 303. During the search operation, the text character numbers are successively read out from the input text memory 301 in sequential time intervals. During each of these intervals, the address formed by the bits of the current character number and the current state number is supplied to the state number table memory 302, to thereby read out the next state number, which is held in the transition state number register 303 until the start of the succeeding interval and is then read out to become the current state number. These operations are successively executed, until the end of the text is reached or the search control system detects a specific operation number and halts the search.
With such a search apparatus, the range of possible character numbers might for example be from 0 to 65,535, while the range of state numbers could be from 0 to 8191. If it is assumed that the state number table memory 302 has 29 address bits, then the low-order 13 bits of these could be constituted by the bits of the current state number, and the high-order 16 address bits could be constituted by the bits of the current character number. Such a FSA search apparatus is described for example in the aforementioned Japanese reference document "Text Searching Processor".
However, such a prior art FSA type of full-text search apparatus has the disadvantage of requiring a large amount of capacity for the state number table memory. This is especially true in the case of a text search apparatus which operates on characters such as Japanese characters, which have a very wide range. If the character range is as high as 0 to 65,535, and if there are 8192 possible different values for the state numbers which are held in the state number table memory 302, then it becomes necessary to store more than 500 million words in the state number table memory 302. This makes it difficult for the above type of FSA search apparatus to be applied to a character set having a wide range, in which each character is expressed by a large number of binary code bits.
One method which has been proposed in the prior art for overcoming the above disadvantage is to form the state number table memory with respective table sections which are assigned to the various states, for example as illustrated in FIG. 4. Each word in the table memory consists of a combination of a character number and a state number. Each table section consists of at least two of such words, with one word being for the default state (defined hereinabove), designated in FIG. 4 as S0. In each of these table sections, the address of one of the words is designated as the base address of that table section. During a text search operation, if for example the S0 state number has been determined to be the next state number, with the base address of the S0 table section (shown as S0') being supplied to the table memory, then the character number for "C" and the state number S1 would be read out from the table memory. The character number "C" is then compared with the current text character. If they are identical, then the base address for the S1 state (S1') is generated, and the character number "A" is then read out and compared with the succeeding text character. If however the current text character number is not identical to "C", then that character number is compared with that of "D". If coincidence is not found, i.e. the current text character is any character other than "C" or "D", then the base address of the default state S0 is again generated, and the above process repeated for the succeeding text character.
It can be understood that with such a system, it becomes unnecessary to store the default (S0) state number at a large number of locations in the state number table memory. Thus the required memory capacity is greatly reduced, by comparison with the direct addressing technique of FIG. 1. However such a method has the disadvantage that it is necessary to execute successive comparisons between each input text character and the contents of a table section. Thus the speed of searching is significantly reduced, which is a severe disadvantage when a large volume of text has to be searched.
To overcome that problem of the time required for sequential comparisons, a finite state automaton full-text search method has been described in U.S. Pat. No. 4,285,049, whereby each word in the state number table memory, corresponding to one specific state, is a "state word", which is a combination of a base number and a set of indexing bits. The number of indexing bits is identical to the number of bits constituting each input character number, so that for an 8-bit byte character number a set of 256 index bits would be required. Hence the embodiment uses sequential 4-bit nibbles for text character input, to enable the number of indexing bits to be limited to 16. All of the indexing bits are set in the `0` state, except for one or more bits which respectively correspond to the success transition character or characters, and which are each in the `1` state. During the search operation, when such a state word has been read out from the state number table memory, a judgement is made as to whether the current input character corresponds to any of the `1` state index bits. If so, then a numeric value, derived from the position of that `1` state index bit, is added to the base number of that state word, to obtain the address of the next state word.
Such a method however has the disadvantage that, in practice, the number of bits constituting a character number must be no greater than 4. Hence it is necessary to process an 8-bit character number as two 4-bit nibbles which are processed sequentially (with a judgment then being made as to whether both of the nibbles match those of a specific transition character). Thus, the search speed is reduced due to such sequential processing, and that problem would be made more severe in the case of characters such as Japanese characters which are expressed by 16-bit character numbers.