1. Field of the Invention
The present invention relates to efficiently identifying approximate matches to a query string among a collection of attribute values in a relation, and more specifically to apparatuses and methods for identifying a top-k optimal final ranking of tuples that approximately match a query tuple.
2. Introduction
Data Cleaning, which is defined as the task of identifying and correcting errors and inconsistencies in data is an important process that has been at the center of research interest in recent years. A collection of attribute values in a relation is known as a tuple. Efficiently identifying approximate matches to a query string among a collection of attribute values in a relation is a key operation for effective data cleaning. Such techniques construct an ordering (ranking) of the relational tuples for the given query string based on different notions of approximate match including edit distance (and variants thereof) and cosine similarity.
One of the end goals of effective data cleaning is to identify the relational tuple or tuples that are “most related” to a given query tuple. Since multiple attributes could exist in the query tuple, issuing approximate match operations for each of them will effectively create an equal (to the number of attributes) number of rankings of the relation tuples. Combining the rankings to identify a final ranking and return a few highly ranking tuples to the application is a challenging task.