This invention relates to data processing systems and, more particularly, to such systems where character are selected by comparing them against specified patterns of characters.
Systems in which a string of input characters from a database stored in some memory device, which may be either fully random access (such as the primary memory of a digital computer system) or serial access (such as a shift register, a magnetic disk memory system, or a bubble memory), is compared against a pattern of search elements specified by a user are becoming common. Examples of these systems include information retrieval systems, database management systems, and pattern recognition systems. Common operations of these systems are to search for the occurrence of all strings of input characters stored in the database which match a given element of the pattern or all locations where two or more elements of the pattern occur within a specified number of characters, words, or within a specified context, such as a paragraph or sentence. In addition to exact character specifications, a given element in the pattern can include special tokens which indicate that any arbitrary character arriving at that time from the database will be accepted at that location within the pattern (called a `don't-care` token), that a specified number of characters will be accepted (a fixed-length don't-care), or that an arbitrary number of characters, perhaps none, will be accepted (variable-length don't-care). Other special tokens indicate that any character from the database of a specified type (alphabetic, numeric, punctuation, etc.) will be accepted in that location.
In most cases, these systems are implemented as programs on general purpose digital computer systems, with the database being compared against the specified pattern using the standard arithmetic and logical operations of the digital computer. Often, however, the time necessary to execute the required instructions to examine each character in the database is so long that efficient matching is not possible, and more complicated storage indexing methods must be employed. While providing adequate performance, these indexing methods limit searching to information matching the items indexed, such as keywords.
In addition to programs running on conventional digital computer systems, other approaches to matching characters in a database against specified patterns using specialized digital machines or processors have been proposed. These can generally be grouped into two broad classes: associative memory systems, and character recognizers attached to conventional memory systems (such as disk or random access memories).
Associative memory systems employ comparison logic at each location within the memory, each capable of comparing the contents of their particular memory word against a search pattern provided as an input to the associative memory system. Systems employing a number of similar comparators, each programmed to recognize a specified word, can be regarded as a special form of associative memory, which stores the pattern elements in the associative memory and passes the information of the database against them for comparison, rather than storing the database in the associative memory and passing the pattern elements against the database. If the data in an associative memory word and the pattern provided the associative memory compare according to the relationship specified by the system user (generally equality or one larger than the other), the comparison circuitry for that memory location sets a flag bit to indicate the success of the operation. Further matching can be over the entire associative memory or only those words with their flag bits set or cleared. Additional facilities are generally present to allow the matching of specified portions of the words, rather than the whole words, allowing the word to be subdivided into fields of information at fixed locations within the word.
Associative memory systems suffer from three deficiencies which make their use in the retrieval of information based on complex patterns from large databases impractical. First, the cost of each bit of memory is high because each must include complex comparison logic, so that only small associative memories are economically viable. Second, the associative memory cannot efficiently handle variable length don't-care operations within a search pattern, especially when the don't-care is embedded within a word, since the comparisons performed by the associative memory are between corresponding bit positions of the pattern and stored words. Finally, the associative memory can only handle fixed length words, so either a maximum word size must be specified or extra processing is necessary to break long words into sizes matching the selected word length of the associative memory.
The attached character recognizer approach, as discussed by Hollaar in the article "Text Retrieval Computers" in the March 1979 issue of Computer, is based on augmenting a conventional memory system, either random access or serial access, with a special match processor which sequentially examines a string of input characters from the database and determines if they match the pattern specified by the user. The simplest implementation is to use a dedicated processor, similar to a conventional general purpose digital computer, to search the database for the patterns. This, of course, suffers from many of the problems of using the host computer system for matching. Generally, this matching consists of comparing characters against one element of the specified pattern until either a match of the element is found or a mismatch of an element's character occurs, at which time another element is searched for, starting at the last position where it was potentially matched. Thompson, in U.S. Pat. No. 3,568,156, has proposed an algorithm which eliminates this need to backtrack to the location of last possible match by effectively comparing all possible pattern elements against the characters from the database at the same time. His technique consists of dynamically creating a list of possible next pattern tokens of interest as comparisons are being done, and then checking them and creating a new list of possible next tokens. This requires a memory device to store the list of next tokens, and the algorithm fails if more next tokens are generated than the memory can accommodate. In contrast, it will be seen that the present invention operates with statically allocated state tables, allowing the potential inability to continue the matching process due to lack of resources to be determined before any matching begins, so that corrective measures can be taken. In addition, the present invention performs all necessary comparisons simultaneoulsy, rather than sequencing through a previously constructed list of tokens, providing a uniform processing time for each character from the database, which can easily be matched to the character delivery rate of the memory system storing the database.
Another means of implementing an attached character recognizer to augment a conventional memory system utilizes a number of cellular comparators, each capable of matching a single token of the pattern such as specified single character or a specific pattern element (such as a don't-care token). This method was separately discussed by Copeland in "String Storage and Searching for Data Base Applications: Implementation of the INDY Backend Kernel," Proceedings of the Third Non-Numeric Workshop, Syracuse, 1978, and Mukhopadhyay in "Hardware Algorithms for Non-numeric Computation," Proceedings of the Fifth Symposium on Computer Architecture, Palo Alto, 1978. A number of these cellular comparators can be connected to match a multicharacter pattern. For example, if the string DOG is to be matched, three cells would be necessary, one to match each token in the pattern. The cells are connected together serially, with each cell indicating to its successor that a successful match has occured up to that point, and that it should examine the input database character against the token stored in the cell and signal whether the match continues to be successful or not. For more complex patterns, logical elements such as AND or OR gates can be included in the connections between cells. This approach suffers from two problems. The first is that the current character from the database must be broadcast to all cells in the system, requiring excessive line driver requirements. This can be solved by the so-called systolic process suggested by Kung (see "The Design of Special-Purpose VLSI Chips" by Foster and Kung, Computer, January 1980), where the characters from the database are shifted through the array of cells serially in one direction, and the pattern being matched possibly being shifted serially in the other direction, eliminating the broadcast requirement but limiting the searching to simple matching of a single pattern. The second difficulty is that, if complex patterns are to be matched, an elaborate connection matrix is required to connect the cells and other logic elements. If this is to be done dynamically in response to a specified set of search pattern elements requiring a large number of cells, the cost and complexity of the connection matrix can easily surpass that of the cells.
A third implementation for an attached character recognizer for a database stored in a conventional memory system is based on a finite state automaton (FSA), such as described generally in Sequential Machine and Automata Theory by Booth, published by Wiley in 1978, which is a simple table-driven digital processor. In its most basic implementation, the FSA has an arbitrary number of states, or conditions, based on the history of past input characters, arranged in the form of tables with the table segment for each state having an entry for each character in the FSA's input alphabet which includes the next state for the FSA if that input character is received from the database. The FSA transitions from one state to the next state specified by the state table until a state marked as a match is found, at which time a signal is generated to notify the user of the match condition. The state table can be configured so that all elements of a user-specified pattern are being searched for at the same time. However, in this basic form an extremely large memory is required to hold the necessary state table (if the machine has an input alphabet containing N characters and if M states are necessary, then the resulting memory size will be N.times.M words of log2 M bits each) even though most entries stored in the state table memory are not of interest and specify a return to the state which begins the match for the start of a pattern element. Storing only the state transitions of interest for each state in a condensed state table memory and searching them to see if one of them or the default transition should occur reduces this memory requirement, at the expense of longer processing time for each character from the database memory. For a database stored on a magnetic disk, which delivers characters at a fixed rate, this may require extensive buffering of the data between the memory and the match machine.
Bird, et al., in Reports R77-002 and R77-008 from Operating Systems, Inc. (21031 Ventura Blvd., Woodland Hills CA 91364) has suggested an approach which recognizes that most states have only one transition of interest, with all other transitions to a default state. These they called "sequential states", and require only the specification of the database character of interest when the FSA is in that state, since the transition will either be to the state stored in the next location in the state table memory or to the default state. A second class of states, called "index states", corresponds to states with transitions to more than one state other than the default state, based on the input character from the database. These require a more complex representation in the state table, indicating where in a table of next state addresses the next state's address is stored. Unfortunately, if any pattern element contains an initial don't-care token, all states must become index states, making the technique simply another implementation of a standard FSA with a fast lookup method for the next state's address.
An alternative to the simple finite state automaton and its various implementations, such as suggested by Bird, is a non-deterministic finite state automaton (NFSA), such as described in Introduction to Automata Theory, Languages, and Computation by Hopcroft and Ullman, Addison-Wesley 1979. While a standard FSA can be only in a single state at a given time, an NFSA can be in a number of different states during each time period. Since a standard FSA is simply a special case of the NFSA, with the number of simultaneous states being limited to one, the NFSA is capable of performing any operation which can be performed by the standard FSA, including pattern recognition on characters from a database. The easiest way to produce an NFSA is to replicate a number of standard FSA's, with an idle FSA started every time the possible beginning of a pattern element is recognized. (In the literature of finite state automata, this is sometimes referred to as having the NFSA "make a duplicate copy of itself.") While the required FSA's for this implementation of an NFSA are slightly smaller, since some state transitions occurring when a wrong path is taken when matching a pattern are not necessary, the extra hardware required to dynamically schedule the FSA's would outweigh this difference. Furthermore, either each FSA must have its own copy of the state table in a memory located in it or a memory containing a common copy of the state table must be shared among the various FSA's (requiring a faster memory), each increasing the cost of the NFSA over that of a standard FSA performing the same operations. Because of this, the non-deterministic finite state automaton has remained only a theoretical concept, although the present invention is based on the basic NFSA principle of being in more than one state at a given time.
It is an object of the present invention to provide a system and method which permits the matching of characters from a database stored in a conventional memory system against a pattern whose elements are of arbitrary length. It is a further object of the present invention to provide a method and system which permits the pattern elements to contain tokens which not only specify the exact character encoding to be matched, but also specify arbitrarily defined classes or types of characters acceptable at that location in the pattern. It is a further object of the present invention to provide a method and system which permits the pattern elements to contain tokens specifying that a specified or arbitrary number of characters are acceptable in that location in the pattern.
Additional objects of the present invention are to provide a method and system which avoids the complex sequencing logic required by other FSA-based systems; which avoids the requirements of large state table memories required by other FSA-based systems; which limits the communications and scheduling requirements of the NFSA implemented by using a number of conventional FSA's; which produces a matcher which requires a fixed amount of processing time for each character from the database memory, allowing it to operate synchronized with the database memory system; and which produces a matcher with a regular, memory-based structure and limited connection requirements, permitting implementation techniques such as fabrication as an integrated circuit to be easily employed.