This invention relates to the use of computer technology for the searching of a mass of digitally stored data for the purpose of recognizing specified patterns within the stored data, allowing the retrieval of data associated with the specified patterns. The most common such application is in the searching of stored free-form textual data to locate documents or other common text groupings which contain specified words, phrases, or numbers. However, the invention described herein is applicable to other digital pattern recognition applications such as might involve digitized photographs, maps, etc.
Retrieval of digital data from mass storage systems is performed by addressing that portion of the mass memory which is known to contain the desired data. There are two ways of knowing where that portion is: either by continuously keeping an index of the data versus the storage address, or by serially searching the entire data base for the desired data pattern. When the amount of such data is large, and particularly when the data base is subject to frequent changes and/or searches, ordinary computers have proven to be too slow when employing either of these methods.
As a result, special mechanisms have been designed to handle the cumbersome task of serially searching for desired patterns. The references provide a list of patents (granted and pending), plus published documents relating to such mechanisms. The patents listed are considered to be irrelevant to the mechanisms described in this disclosure, as they employ completely different techniques.
On the other hand, references 6 and 7 both described the general application of FSA's to the task of text searching. Reference 6, in particular, describes a machine architecture involving a microprocessor.
Prior Method: PA0 1. U.S. Pat. No. 3,435,423: Data Processing Systems; Fuller, Worthy, et al PA0 2. U.S. Pat. No. 4,044,336: File Searching System With Variable Record. Boundaries; Babb PA0 3. U.S. patent application Ser. No. 772,935: Associative. Crosspoint Processor System; Bird and Tu; issued May 1, 1979 as Pat. No. 4,152,762. PA0 4. U.S. patent application (companion to this application); Selection of Finite Success States by Indexing; Bird and Tu; Ser. No. 950,326, filed Oct. 11, 1978. PA0 5. On The Construction and Minimization of Finite State Automata; Millan and de Carvalho; a monograph published by the Pontificia Universidade Catolica do Rio de Janeiro, July, 1971. PA0 6. The Design of a Microprogrammed Finite State Machine for Full-Text Retrieval; Bullen and Millen; AFIPS Conference Proceeding, Fall Joint Computer Conference, 1972, pp. 479-488. PA0 7. A Survey of Regular Expressions and Their Applications; Brzozowski; IRE Transactions on Electronic Computers, June 1962, Volume EC-11.
The general method of reference 6 employs a table of stored words associated with each state. This table identifies all possible transition states for every possible input character. At each state, the associated table is searched for the character or character group corresponding to the input character, and when found an associated stored code indicates the transition state.
In order to conserve table storage, the possible input byte codes are grouped, depending upon the nature of the input queries, such that a given transition may occur for a number of possible inputs. One or more such groups may define default states; i.e. the default state identities are explicitly defined in the table.
The successful completion of a state sequence then requires a stored table and a table search for every state in the sequence. A stored bit at the end of the sequence indicates a successful completion of the sequence.
A microprocessor is used for implementing these searches. However, because there is a search required at each state, the processing time is too great to keep up with a character stream which can be expected from a mass storage unit such as a disk. The paper indicates that a minimum of 4.5 microseconds plus 0.9 microseconds per table probe (examination of one table entry) is required per state. Since a modern disk can supply characters at a rate of 1 million per second (typically), the method is considered at least 5 times too slow.
The invention described herein utilizes standard electronic logic techniques to implement an unique architecture which eliminates or radically reduces the difficulties which the approach of reference 6 presents. The speed capability is sufficient to keep up with a disk character rate, and memory requirements are sufficiently minimized to allow many dozens of queries to be handled simultaneously. In addition, other capabilities, such as numeric ranging, are now feasible.
REFERENCES