Data cleaning is the process of fixing inconsistencies in data before the data is used for analysis. A common form of inconsistency arises when the same semantic entity has multiple representations in a data collection. For example, the same address could be encoded using different strings in different records in the collection. Multiple representations arise due to a variety of reasons such as misspellings and different formatting conventions.
A string-similarity lookup is a useful primitive for fixing such representational data inconsistencies. A string-similarity lookup identifies all strings in a collection that are similar to a query string. For example, a City column of an unclean table that contains possibly misspelled city names can be cleaned by performing a string-similarity lookup against a reference table of city names with correct spellings. Alternatively, each city name in the unclean table could be replaced with the most similar city name in the reference table.
String-similarity lookups are also useful in several other applications. Identifying documents that are textually similar to a given document is useful in many contexts such as identifying mirrors on the Internet and copy detection. There is an increasing interest in supporting large text data within database management systems, and string-similarity lookups are likely to occur naturally in such settings.
A common primitive useful in data cleaning and other applications is that of identifying all sets in a collection similar to a given query set. However, conventional mechanisms for indexing such sets for efficient set-similarity lookups continue to be problematic.