This invention relates generally to information processing systems, and more specifically, to special-purpose processors for searching data bases to locate particular patterns of data. This type of processing arises in a number of different contexts, but can be best understood in terms of a search of a data base to locate all the occurrences of a particular word or phrase. In the past, computer software running on conventional hardware has been used to perform such searching, but has been found to suffer from a number of practical limitations.
Conventional hardware for sequentially searching a large data base from beginning to end is likely to take so much time as to be totally impractical, and various software techniques have been used to organize the data in such a way that the system has relatively good performance for what is considered a typical search. These techniques usually involve some type of indexing scheme, in which large tables contain the location or locations of every item in the data base. These index tables may be comparable in size to the actual data base, and they are often cumbersome to build and organize. Moreover, a system that requires indexing tables is inconvenient to use for searching data bases of which the content may vary with time.
Even with the use of index structures, software searching is very much dependent on the number and complexity of search conditions imposed for a given search task, and the general-purpose computer employed has an operating system overhead that further slows the searching process. As a result, actual data processing rates that can be obtained are usually only a fraction of the maximum data rates of mass storage devices on which data bases are usually stored.
Because of the limitations of software-controlled searching techniques, hardware devices to aid in the searching process have been devised. These fall into two categories: content-addressable memories and special-purpose processors. Content-addressable memories are memory devices capable of comparing their contents with a pattern presented on a common bus. Such memories are prohibitively expensive for large data bases, and, in any event, have limited utility, since they are typically capable of performing only exact match operations.
Special-purpose processors for data searching employ low-cost memory from which data is accessed by dedicated pattern-matching circuitry. The search conditions are typically stored in the processor prior to the search, and data is fed into the processor during the search. A particularly desirable form of a special purpose processor incorporates all of its logic onto a single integrated-circuit chip, with an expansion capability based on the use of several interconnected chips.
One such processor, by Mead and associates at the California Institute of Technology, uses a 128-bit comparator to compare text input with a resident pattern. (See Mead, C. A. Pashley, R. D., Britton, L. D., Daimon, Y. T., and Sando, S. F. "128-bit Multi-Comparator," IEEE Journal Solid State Circuits, SC-11(5):692-695, October, 1976). A mask register allows the equivalent of variable-length "don't care" characters in the pattern. In other words, the pattern may be designated as containing a variable-length segment, the content of which does not affect the matching process.
Foster and Kung have proposed a systolic pattern-matching chip consisting of two kinds of cells. (See Foster, M. J., and Kung, H. T. "The Design of Special-Purpose VLSI Chips," IEEE Computer, 13(1), January, 1980). The processor does not store the pattern being searched for, requiring its recirculation along a parallel data path to the data being searched. The systolic nature of this processor, which implies a pipeline of interconnected cells with each cell only sharing signals with its immediate neighbors, makes it particularly adaptable to high density layout in integrated circuits.
A second systolic design was proposed by Mukhopadhyay of the University of Central Florida with a structure including a pipeline of a single type of cell. (See Mukhopadhyay, A., "VLSI Hardware Algorithms," In Rabbat, G. (editor), Hardware and Software Concepts in VLSI, ch. 4, pp. 72-94, Van Nostrand Reinhold, 1983). In this system, a pattern is loaded in from one end of the pipeline and text data to be searched is loaded in from the opposite end. The system allows both fixed-length and variable-length "don't care" characters.
Even though these and other proposed systems perform pattern matching at high speeds with various "don't care" capabilities, they do not represent complete data search systems. For example, these systems do not perform Boolean functions, complex proximity functions, or handle approximate matches. Accordingly, a system built around such devices would have an unpredictable response time, depending on whether or not the special hardware could be used in any particular search query. This is, in many ways, the same problem that faces traditional software solutions.
The cross-referenced applications, which are not prior art with respect to the present invention, represented a significant step forward in the solution of the problems associated with the prior art. However, the system disclosed and claimed in the cross-referenced applications is limited in some important respects. In particular, the earlier system could handle a limited number of mis-spellings in the text being searched, but was unable to deal with missing or extra characters in the text. Without the ability to handle missing or extra characters, a search pattern with only minor mis-spelling could be missed in the text search.
It will be appreciated that there is still room for improvement over the system disclosed and claimed in the cross-referenced applications. Ideally, a search processor should have the capability of recognizing search patterns even if the text contains a limited number of extra characters or missing characters, as well as mis-spellings with the correct number of characters. The present invention is directed to this end, and to providing a high-speed text-searching system capable of performing a large number of different search functions.