The present invention relates generally to information retrieval, and more particularly to set similarity selection queries.
Due to the widespread popularity of the global Internet, information retrieval from databases has become a familiar practice for many users. Users search the global Internet for a wide range of information, from telephone numbers to automobile ratings to esoteric scientific data. In a search, a user issues a query (request for information) to a database containing stored information of interest. An information retrieval system then retrieves information relevant to the query. As a simple scenario, consider a user who wishes to find the telephone number of a specific person. The user issues a query, which contains the name of a specific person as input, to an information retrieval system. The information retrieval system then searches an electronic phonebook containing records matching people's names with their corresponding phone numbers. If the search is successful, the phone number of the specific person is retrieved and returned as output to the user.
In general, information retrieval is a complex process, due to both the nature of the query and the nature of the stored information. In many instances, a query may not fully define the information of interest. For example, in many instances, a query contains only a few keywords. The information may be stored in multiple records stored in multiple databases (consider the vast number of websites on the global Internet, for example). A principal function of an information retrieval system is to search through the databases and return only those records which are highly relevant to the query. It is desirable for an information retrieval system to be efficient (for example, to reduce required computer resources such as processor usage and memory) and to be fast (for example, to support near-real-time interactive sessions with a user). It is also desirable for an information retrieval system to have high accuracy (that is, to not miss relevant records and to not retrieve irrelevant records).
One issue which arises in information retrieval systems is the treatment of data inconsistencies. The causes of data inconsistencies may range from trivial (for example, typographical errors) to complex (for example, incompatible database formats). Data inconsistencies impact both the quality of the data stored in the databases and the effectiveness of information retrieval. Correcting errors in the databases is referred to as data cleaning. For example, there may be similar entries in a database which are actually duplicates of the same entry (but one has been mis-spelled, or entered in a different format, for example). Removing duplicates is an example of a data cleaning process. The data cleaning process, however, needs to minimize the probability of removing an entry which is similar to, but actually distinct from, another entry. Data cleaning may also be applied to the query as well.
Accommodating data inconsistencies in queries, on the other hand, is important for efficient retrieval of records which have a high probability of being relevant to a user query. Requiring an exact match between a term in a query and a term in the database may cause relevant information to be rejected. For example, a record pertaining to “autmobile” (mis-spelled entry) may have a high probability of being relevant to a query for “automobile”. Too loose a match, however, may result in an excessive number of irrelevant records being retrieved. For a example, a reference to “automatic” may yield records principally irrelevant to a query for “automobile”, with the exception of records pertaining to “automatic transmission”.
One key process used in information retrieval is set similarity selection, which determines when two sets of terms are similar enough to be of interest, either for data cleaning, information retrieval, or other user-defined applications. Various set similarity methods have been developed. In many instances, however, they are inefficient and slow. What are needed are method and apparatus for set similarity selection which are efficient, fast, and accurate.