1. Field of the Invention (Technical Field)
The present invention relates to the field of digital electronic devices for recognizing specified patterns in a data stream, specifically claimed is a general purpose architecture and device for set theoretic processing.
2. Description of Related Art
In the pattern recognition field, whether in searching textual data or other form of source data, four problems persist, notably acuity, throughput, scalability and cost. Accordingly, these are the measures of effectiveness for any new technology in this field.
Acuity is measured as the sum of False Positives and False Negatives in the results of a scan. False positives occur when the scan results in irrelevant patterns. False Negatives occur when the scan fails to identify patterns that are, in fact, relevant. The ideal pattern recognition device will not make such errors. The next best device will enable a user to control the amount of error as well as the ratio of False Positives to False Negatives and to trade off acuity vs. throughput and/or cost for each user situation.
As to scalability, pattern matching involves identifying symbols and syntax identical to the user's expression of interest. Pattern recognition goes further by enabling the specification of a class of equivalent terms and by aggregating all qualifying instances. This distinction became important as search technology progressed from text search to imagery search to detection of malware for cybersecurity and for detection of evolving patterns in biocomplexity informatics. Accordingly, minimizing False Positives and False Negatives may entail reference patterns that specify upwards of 100 terms each averaging 10 features. The vaunted Google search engine experienced an average query of 2.3 terms, circa 2004. Modern recognition devices should have the capacity to compare upwards of 1000 features to the source data stream and be scalable to 1 million features. Other dimensions of scalability include implementation of a personal-scale device to a federated system of millions of devices throughout the World Wide Web.
Throughput is the speed at which the source data is examined and results reported. The typical measure of throughput is characters per second. Constant or at least predictable throughput is best, not dependent on the complexity of the reference pattern nor the number of recognitions per unit time.
Cost is the total cost of ownership of a search episode regardless of how the costs are allocated. For example, a user pays nothing for a Google search episode but somebody is paying for the MIPS (millions of instructions per second), bytes and baud that are used to accomplish the search. Cost includes the preprocessing of source data as well as the actual search episode.
Starting approximately in the mid-1900's, various strategies have sought to improve performance in one or more of the measures by leveraging technologies and innovating architectures. The state of the art is still far from realizing, simultaneously, specifiable acuity, throughput at the speed of silicon, scale from single user to federated users and cost levels that small and medium enterprises and even individuals can afford.
The following paragraphs summarize the field to date.
The general purpose computer, designed to perform arithmetic, has been used since the 1950's for comparison of digital data in the form of characters, character strings and combinations of character strings. A straightforward program can utilize the memory and CPU (central processing unit) to input a reference character then compare it, sequentially, to multiple characters of various kinds as presented by a source data stream. The presence in the source data stream of a matching character can be flagged for subsequent reference. More than one such character may be found and flagged and more than one instance of any specific character may be found and flagged. Each such operation consumes several clock cycles of a modern microprocessor.
If the reference is a string consisting of multiple characters in a specific sequence, i.e., representing a word of text, or a music melody or a genomic pattern, then a more complicated program is required. Character-level comparison proceeds as before then the interim results are stored for subsequent processing to determine whether the characters that qualified are in the sequence required for a word-level match. This is known as the combinatorial explosion problem because the number of machine cycles increases as the square of the number of conditional matches. Such operations typically consume thousands of clock cycles and there is no upper limit.
If the reference consists of a phrase, e.g., a string of words in a specific order, then the program, using recursion, is only somewhat more complicated but the combinatorial explosion can become even more dramatic and can consume billions of clock cycles. Although grid configurations of megaflop processors can supply the clock cycles the expense of the device quickly becomes prohibitive and throughput can be in the range of only a few characters per second.
The dismal acuity exhibited by pattern matching machines to date stems from attempts to avoid the combinatorial explosion problem. Most text applications use key words to surrogate the source data. Then searches compare the reference pattern to only the key word file, not to the actual text. This approach invokes the well known problem of retrieving citations that are irrelevant (false positives) thus incurring waste and cost. A not so well known but worse outcome is that are truly relevant patterns in the data being scanned are not recognized (false negatives) because the content was not represented with sufficient fidelity by the key words used.
Attempts to extract meanings from text are frustrated by the complexity of the problem but more so by the indexing to terms as noted in the previous paragraph. Implementations of pattern recognition have been demonstrated with pre-processing and post-processing software, such as statistical clustering and Latent Semantic Indexing. In essence these seek to overcome the limitations inherent in indexed or censored text streams by further processing of the term matches found by the hardware. This approach to recovery of meanings in the censored text can never overcome the limitations imposed by the censoring in the first place.                United States Patent Publication 2005/0154802, Parallel pattern detection engine. Multiple processing units (PUs) customized to do various modes of pattern recognition. Each pattern has an Opcode e. PUs may be cascaded to enable longer patterns to be matched or to allow more patterns to be processed in parallel for a particular input data stream. Cost and throughput are highly suspect. Also, the application appears to implement nesting but not equivalence classes.        
SIMM, SIMD and similar hardware CPU embellishments have been invented for data flow applications. These have demonstrated speed improvements in the single digit range but also increased cost.                United States Patent Publication 2005/0257025, State engine for data processor. Uses parallel processors, such as SIMD array processors. Claims that a read/modify/write operation can be performed in only two cycles and a complete command in only three to five cycles. These performances appear to be for fundamental pattern matching but not for partial matches and consideration of variants such as plurals.        
Supercomputer configurations and more recently grid configurations of microprocessors have been programmed for set theoretic processing both for Very Large Data Base situations and genomic research. Cost inhibits most potential users from this option.
In the limit, the basic Von Neuman stored program computer simply cannot exhibit the speed/cost ratios available with other implementations.
The special purpose processor category contains many examples of prior art. Early examples include the General Electric series;                U.S. Pat. No. 3,358,270, December 1967.        U.S. Pat. No. 4,094,001, Digital logic circuits for comparing ordered character strings of variable length, Jun. 6, 1978.        U.S. Pat. No. 4,451,901. High speed search system. May 29, 1984.Also in this category are the TRW series;        U.S. Pat. No. 5,051,947, High-speed single-pass textual search processor for locating exact and inexact matches of a search pattern in a textual stream, Jun. 6, 1978.        U.S. Pat. No. 4,760,523, Fast search processor, Jul. 26, 1988.        
Being complex logic devices, special purpose processors are not only expensive to produce (product cost exceeding $10,000) but also subject to failure rates that frustrated operational users. Further, the designs were neither scalable nor extensible. The method of performing character/character set comparisons limited them to pattern matching rather than pattern recognition.                U.S. Pat. No. 4,747,072, Pattern addressable memory, May 24, 1988. Performance shortfall from rule based active construction of variable content in key words. Does not accommodate equivalence classes.        United States Patent Publication 2003/0055799, Self-organizing data driven learning hardware with local interconnections. Does not handle equivalence classes. Throughput shortfall.        U.S. Pat. No. 4,531,201. Patterns limited in length to size of shift registers.        U.S. Pat. No. 4,625,295. Can handle 16-bit characters at loss of throughput. Length of shift registers limits length of words. Cannot detect Kleen Closures. Decoder requires a delay to decode a character. Does not handle equivalence classes.        
Being ever more complicated, logic devices using parallelism for throughput in various string pattern matching scenarios each exhibited cost limitations were limited in scalability as well. Also, precise internal timing of the logic circuits made it nearly impossible to re-implement them as semiconductor technology advanced.
Associative Memories and Contents Addressable Memories have been used to reduce the number of clock cycles required to fetch data into registers. While improving performance these approaches do not reduce costs and none to date have yielded significant improvements regarding the acuity measure of effectiveness.                United States Patent Publication 2004/0123071, Cellular engine for a data processing system. Matching device. Does not support Boolean, semantic or set theory logic.        United States Patent Publication 2004/0080973, Associative memory, method for searching the same, network device, and network system. Associative memory carries out a search operation in plural fields. Does not appear to support complex reference patterns. Three memories, three cycles indicates shortfalls in throughput and acuity        United States Patent Publication 2003/0229636, Pattern matching and pattern recognition system, associative memory apparatus, and pattern matching and pattern recognition processing method. Word-level, multiple cycles.        United States Patent Publication 2003/0014240, Associative memory device with optimized occupation, particularly for the recognition of words. Spreadsheet approach, relationships among classes not supported. Logic limited,        United States Patent Publication 2004/0250013, Associative memory system, network device, and network system. Includes Reset to find second instance of a pattern. Does not support equivalence classes.        
Neural Net based recognizers have been used for string pattern matching in order to implement rapidly adaptive reference patterns. These have proven effective in specific applications such as email spam signature identification and adaptive tracking but do not exhibit the speed and performance for general application.                United States Patent Publication 2005/0049984, Neural networks and neural memory. Does not support equivalence classes.        United States Patent Publication 2002/0059152, Neural processing module with input architectures that make maximal use of a weighted synapse array. Symbol syntax but not semantics.        United States Patent Publication 2002/0032670, Neural network processing system using semiconductor memories. Aggregations are linear.        
Application-specific devices have been devised for pattern matching of 2D and 3D images, but do not have the Boolean, semantic and set theory logic.                United States Patent Publication 2002/0125500, Semiconductor associative memory. Emphasizes processing versus use of memory and systolic operations. Shortfall in throughput and cost.        United States Patent Publication 2002/0168100, Spatial image processor. Assumes location implicit relationships in data stream (e.g., pixels).        United States Patent Publication 2003/0194124, Massive training artificial neural network (MTANN) for detecting abnormalities in medical images. Uses sequential stored program.        United States Patent Publication 2004/0156546, Method and apparatus for image processing. Defines three categories of processing, object-independent processing a plurality of processors each of which is associated with a different one of the pixels of the image, object-dependent processing using a symmetric multi-processor. The plurality of processors may form a massively parallel processor of a systolic array type and configured as a single-instruction multiple-data system, and object composition, recognition and association, using a unified and symmetric processing of N dimensions in space and one dimension in time. The plurality of processors is formed on a semiconductor substrate different from the semiconductor substrate on which images are captured.        
Systolic Arrays use pulse propagation through preformed switching networks to parallelize logical relationships without resorting to the von Neumann paradigm. These are much faster and less expensive than sequential processors. Inventions and embodiments to date have been application specific and have not allowed sufficiently quick reconfigurations of the switching network.
The more particularized field of the present invention started with machines that detected Match or No Match, one term at a time. Next came full Boolean operators across collections of terms. Then the addition of Don't Care logic to the Match, No Match choices allowed detection of partial matches. Delimiters enabled detection of syntactic clues such as end of word, end of sentence, end of paragraph, end of section, end of file, end of record, etc. Next came detecting strings of terms (e.g., phrases). Throughout was the presumption that the search would be composed and expressed by humans in the form of queries. Search engines or adjunct software did not greatly aid how humans formulated queries. Linguists created very sophisticated preprocessing and post processing software but these did not make significant improvements in acuity, throughput, scalability or cost, let alone improving all at the same time. The field of artificial intelligence paralleled the search engine field but without significant cross-disciplinary sharing.
The field started with machines that detected Match or No Match, one term at a time. Next came full Boolean operators across collections of terms. Then the addition of Don't Care logic to the Match and No Match choices allowed detection of partial matches. Delimiters enabled detection of syntactic clues such as end of word, end of sentence, end of paragraph, end of section, end of file, end of record, etc. Next came the capability to detect strings of terms (e.g. phrases).
Throughout was the presumption that the search would be composed and expressed by humans in the form of queries. Search engines or adjunct software did not greatly aid how humans formulated queries. Meanwhile, linguists created very sophisticated preprocessing and post processing software but these did not make significant improvements in acuity, throughput, scalability or cost, let alone improving all at the same time.
Many of the advancements were paced by Moore's price/performance law in the semiconductor field. The field of artificial intelligence paralleled the search engine field but unfortunately without significant cross-disciplinary sharing. The advent of knowledge management in the 1990's made many more people aware of lexicons and taxonomies, the [then] means of expressing the relationships among entities in addition to describing the entities. The advent of semantic web development in 2001 intended to enable computers to interchange data based on the meaning of the data instead of just on its location in a format has fostered expressions of knowledge models as formal ontologies.
Set theory has proven useful for making assertions about relationships. The advent of digital image patterns search has advanced its use considerably. Currently, set theoretic expressions are as prevalent as Boolean logic and algorithmic operators. This has prompted distinctly new machine architectures featuring systolic arrays and data flow-facilitation examples. Now interest increasing from facilitation of computer system data interchanges to facilitation of human knowledge interchanges and to facilitation of interchanges among diverse, distributed systems of humans. Applying digital devices to the disambiguation of human communications is the next wave in this field. The astounding complexity of this challenge motivates development of a general purpose machine that can execute a variety of such expressions. The present invention provides such a method and apparatus.
In order to locate all of the data objects relevant to a given referent and only those objects it is necessary to overcome the heterogeneity of the subject data. Heterogeneity exists on two levels. The first level for digital text is transcription variation such as differences in spelling, spelling errors, typographical errors, punctuation differences, spacing variation, the presence of “special” bytes used to control the display or transmission medium but which themselves carry no meaning, and recently, obfuscation characters intended to spoof spam detectors. A popular example of the latter is:                “Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the Itteers in a wrod are, the olny iprmoetnt tihng is taht the frist and Isat Itteer be at the rghit pclae. The rset can be a total mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey Iteter by istlef, but the wrod as a wlohe.”        
Cognates in other types of digital data are, for example: DNA spelling errors, Background noise, and Varying pronunciation.
The second level of heterogeneity is the semantic level. Humans are gifted inventors of different ways of expressing the same idea. This means that for every component of a referent many variations may be possible (varying sentence and paragraph structuring, and many figures of speech (synonyms, allegory, allusion, ambiguity, analogy, eponym, hyperbole, icon, index, irony, map, metaphor, metonym, polysemous meaning, pun, sarcasm, sardony, sign, simile, synecdoche, symbol, token, trope) and class, subclass, idioms, and super class words and expressions).
Accordingly, rather than looking for the occurrences of a word or phrase, or even a few words or phrases, in a body of text, Ashby's Law of Requisite Variety (for appropriate regulation the variety in the regulator must be equal to or greater than the variety in the system being regulated) demands a way of finding all of the expressions equivalent to a set of referents. A responsive machine must be able to process set theoretic operators as well as Boolean operators and semantic operators such as precedence and aggregation. In this document the set of strings equivalent to a referent is called an equivalence class. A description of the members of an equivalence class is called a Referent Pattern.
Acuity is achieved by providing a means to scan a data stream for content fulfilling Reference Patterns of sufficient selectivity and sensitivity to perceive just the digital objects of interest. Simultaneously, throughput, scalability and cost must be achieved as well.