Gathering and retaining information associated with such broad topics including equipment, business transactions, medical records, people, etc., has increased over the years as computers have made it easier to store, manipulate and gather the information (i.e., data). Databases are now maintained to track everything from business trends to terrorists.
To organize and improve access to the data stored in databases, the data is often indexed. Typically, an indexing technique generates a key for each element of the data (i.e., data strings in the database) to be indexed and then uses an available indexing structure, such as, Binary Tree, B-Tree, etc., to assign the keys to index nodes. In equality indexing, the data strings themselves act as the key for indexing. In conventional fuzzy indexing systems, the key is generated using an algorithm, such as, SOUNDEX, METAPHONE, etc.
For example, using SOUNDEX, the data string “JULIANO” is keyed as JLN and the data string “JUKIANO” is keyed as JKN. The two different keys JLN and JKN are then indexed in two different nodes. Thus, while conventional fuzzy indexing systems may provide a broader indexing system, in some instances, the keys generated by fuzzy indexing system may be assigned to separate nodes just as in equality indexing. Accordingly, using the SOUNDEX indexing technique, a query does not result in a match for JULIANO and JUKIANO while these data strings may simply be the result of typographical errors. Errors in databases can be caused by both manual and automatic data entry. When subsequent searches fail to find relevant data records, information may be missed or duplicated in a database system. This may result in inaccurate or missing information and prevent a complete picture of a customer's, patient's or terrorist's activity within the database system.
As mentioned above, conventional fuzzy indexing systems, such as, SOUNDEX, METAPHONE and DOUBLE METAPHONE, are used in the data warehousing industry to index data. Even the logic of conventional fuzzy indexing systems, however, may not associate data strings with the same nodes of an index and are not powerful enough to match strings such as JOHN and DON, or DAVID and DACID.
Accordingly, what is needed in the art is improved systems and methods for indexing and querying databases that allows matching data strings even when the data strings are not exactly equal.