Record linkage is the problem of identifying when two (or more) references to an object are referring to the same entity (i.e., the references are “co-referent”). One example of record linkage is identifying whether two paper citations (which may be in different styles and formats) refer to the same actual paper. Addressing the record linkage problem is important in a number of domains where multiple users, organizations, or authors may describe the same item using varied textual descriptions.
Historically, one of the most studied problems in record linkage is determining whether two database records for a person are referring to the same real-life individual. In applications from direct marketing to survey response (e.g., the U.S. Census), record linkage is often seen as an important step in data cleaning in order to avoid waste and maintain consistent data.
More recently, record linkage has become an issue in several web applications. For example, the task of determining whether two paper citations refer to the same true publication is an important problem in online systems for scholarly paper searches, such as CiteSeer (http://citeseer.ist.psu.edu) and Google Scholar (http://scholar.google.com).
A new record linkage problem—called product normalization—arises in online comparison shopping. Here, two different websites may sell the same product, but provide different descriptions of that product to a comparison shopping database. (Note: Records containing product descriptions are also called “offers” herein.) Variations in the comparison shopping database records can occur for a variety of reasons, including spelling errors, typographical errors, abbreviations, or different but equivalent descriptions that are used to describe the same product. For example, in online comparison shopping, shopping bots like Froogle (http://froogle.google.com) and MySimon (http://www.mysimon.com) merge heterogeneous data from multiple merchant websites into one product database. This combined product database is then used to provide one common access point for the customer to compare product specifications, pricing, shipping, and other information. In such cases, two websites may have two different product offers that refer to the same underlying product, e.g., “Canon ZR 65 MC Camcorder” and “Canon ZR65 Digital MiniDV Camcorder.”
Thus, a comparison shopping engine is faced with the record linkage problem of determining which such offers are referring to the same true underlying product. Solving this product normalization problem allows the shopping engine to display multiple offers for the same product to a user who is trying to determine from which vendor to purchase the product. Accurate product normalization is also important for data mining tasks, such as analysis of pricing trends.
In online comparison shopping, the number of vendors and the sheer number of products (with potentially very different characteristics) make it very difficult to manually craft a single function that can adequately determine if two arbitrary offers are for the same product. Moreover, for different categories of products, different similarity functions may be needed that capture the notion of equivalence for each category. Hence, a method and system that provide for efficient production and training of similarity functions between offers and/or between product categories is needed.
Furthermore, in many record linkage tasks, such as product normalization, the records to be linked actually contain multiple fields (e.g., product name, description, manufacturer, price, etc.). Such records may either come in a pre-structured form (e.g., XML or relational database records), or such fields may have been extracted from an underlying textual description. Hence, a method and system that provide for efficient production and training of similarity functions between offers with multiple fields is also needed.
Another consideration in record linkage problems like product normalization is the fact that new data is continuously becoming available. As a result, a learning approach to the linkage problem in such settings should be able to readily use new training data without having to retrain on previously seen data.
Thus, it would be highly desirable to develop methods and systems that efficiently produce and train composite similarity functions for record linkage problems, including product normalization problems.