Set based comparisons may be useful primitives for supporting a wide variety of similarity functions in textual record matching. Various techniques have been proposed to improve the performance of set similarity lookups. These techniques focus almost exclusively on symmetric notions of set similarity. However, asymmetric notions of set similarity may provide a useful tool for indexing string sets, an important component of textual record matching.
Examples of asymmetric measures of set similarity may include the use of Jaccard containment. Jaccard containment alone may not be an efficient measure of similarity for longer textual strings, however string transformations allow the Jaccard containment to effectively measure similarity for longer strings. String transformations may also provide a programmable level of error in an input query set. Additionally, a well-organized data structure, such as an inverted index, may provide greater efficiency for look ups based on an input query set.