In an analytical name search engine, to determine whether a query name matches another name, the query name is compared to each other name from one or more data lists being searched to calculate a similarity score between each pair of names. Data list names whose similarity score is greater than a user-defined threshold are then returned to the user. Given the size of the data lists being searched, this process of comparing every name may be very slow.
A bitmap may be designed so that each bit in the bitmap represents a character n-gram. A character n-gram may be described as a group of n adjacent characters (e.g., in YOGI, GI may be a character n-gram where n=2). One way to improve speed is to use a bitmap filter in which character n-grams of n-letters turn on individual bits in the bitmap to form a bitmap signature. Names whose bitmap signatures do not overlap to the degree of the user-defined threshold are not subjected to more in-depth comparison, thereby, speeding up search.
However, the large number of potential character n-grams (e.g., using only alphabetic characters and a space results in 729 bigram combinations for n=2 and 18,954 trigram combinations for n=3) causes a one-to-one mapping of character n-grams-to-bits to be beyond the scope of current processing capabilities, which typically involve a bitmap of 64 bits. This limitation on the size of the bitmap filter forces overloading of the bits such that multiple character n-grams are assigned to the same bit.
For example, using n=2, if bigrams were assigned to 64 bits in a round-robin fashion, a bitmap may have a single bit representing the following bigrams: AA, CK, EU, HD, JN, LX, OG, QQ, VJ, XT. Any two names that share any of these bigrams would have the same bit turned on in their bitmap signatures. The unrelated names JACK and YOGI would, therefore, appear to be at least partially related, since the CK in JACK and the OG in YOGI would turn on the same bit. Sending unrelated name records through for more in-depth comparison when a similarity arises from this kind of overloading can reduce system performance. In addition, this situation may result in reduced search precision, since sending more unrelated matches through for further comparison will likely result in incorrect search returns.
In addition to allowing more unrelated names to be sent through for further analysis, the random distribution of bigrams into bitmap positions may prohibit legitimately related names from undergoing further comparison due to bitmap overloading. A bitmap distribution may be described as a table in which each column represents one or more character n-grams, and each row represents a name to be compared. The bitmap distribution is used to create bitmap signatures. Each row in the table for a name has a bitmap signature made up of bits that are set or not set according to the n-grams of that name and the shown distribution. Suppose, for example, that the bigrams NS, ON, and JO were all assigned to the same bitmap position. Also, the bigrams _J and N_ were assigned to the same bitmap position. Then, the two names JOHNSON and JONSEN have two common bits set in their bitmap signatures. These two possibly related names JOHNSON and JONSEN would not be subject to further comparison, since there are not enough bitmap positions in common.
In this example, only two bitmap positions are in common across the combined number of ten distinct bigrams contained in the two names. The fact that there are two clashes of bigrams, one of two bigrams and one of three bigrams, (i.e., multiple bigrams that are assigned to the same bitmap position) prevents there being enough bitmap positions in common to send the names through for the full similarity calculation. This situation will result in lower recall scores, since some legitimate matches will be excluded from further comparison.