This invention is a particularly efficient method, and hardware apparatus, for testing membership of an item in a large set. A set is a collection of items. A set membership test is an operation upon a given item and a set. The operation is a predicate function which returns the value "true" if the item is a member of the set and "false" otherwise. The present invention is a particularly efficient device for testing membership in a set and has been reduced to practice with a set which consists of most frequently occurring English words. However, it should be understood that the present invention is not limited to testing membership of a word in a given set of words, but has an infinite potential for testing whether any given item is a member of any given set.
Content addressable memory is a special memory hardware device which allows access to data items not only by location, in the manner of traditional memories, but also by the contents or value of a table item. In this manner content addressable memory performs a parallel search through all available memory to find all matches. In contrast, this invention performs a set membership test by utilizing a pseudo content addressable function based on hashing.
The present invention comprises both software and hardware elements and may be conveniently identified by the name "Micromark". The Micromark system, or Micromark, is essentially a device for testing membership in a set, which is particularly efficient because it employs a very efficient piece of hardware called a hash board, and an efficient and inexpensive form of memory, to interact with this hash board. Micromark, therefore, is a general purpose list matcher and comparer, embodied in hardware which allows any membership question to be answered, very efficiently. Micromark performs this function with a process, which is also considered unique, in that the conceptual approach of the present invention is not considered known in the prior art.
At this point, applicants not that there is a basic algorithm involved here, and this aalgorithm forms one principle upon which the Micromark system is based. The "algorithm" is per se known in the prior art, since a mathematical analysis of binary hash coding, with allowable errors, was discussed in a 1970 article by Mr. Burton H. Bloom. Nevertheless, the present approach is not strictly a mathematical exercise, insofar as there is truly no algorithm that is meant to be preempted. This procedure is not equivalent to an algorithm for set membership testing in that the procedure yields answers with a small finite known probability of error.
The procedure might more properly be called a "HEURISTIC" (i.e., a rule of thumb) for determining set membership.
The preferred embodiment of the present invention is illustrated as a spelling-checking technique, where it is not necessary to be able to recall words from a memory device. Rather, it is only necessary to know whether or not a particular word is a member of a set, i.e., whether a particular word is spelled in the same manner as any word is spelled within the set. To further illuminate the background of the preferred embodiment, applicants will now discuss prior art types of spelling-checking devices, and particularly those wherein a vocabulary of words are matched to a word under scrutiny in a straight content-type and addressable lookup manner.
In modern editorial/composition systems, text is stored on disks or other mass storage devices. Programmers view text as a sequence of characters. For spelling-checking, it is necessary to organize the characters into words. Let us assume that we have a procedure which will tell us if a given word is correctly spelled. We could call this routine TESTSP (TEST SPelling). Given a character sequence of string, TESTSP would test the string and would yield the result, "YES", if the word is okay, or "NO", if the word is erroneous.
A program which proofs spelling and uses a TESTSP routine would read the sequence of characters that comprises a story or article. It would then break this sequence into a series of words and, for each word, would invoke TESTSP. If the TESTSP returned "NO" for a word, the proofing program would mark the word for later revision. The result of this process would be a new story file with error words specially denoted. With an interactive system, editors could correct marked error words on a VDT. Interactive spelling-proofing could be an integral part of an on-line test editing procedure.