An online multi-merchant electronic marketplace is a virtual location where multiple merchants compete in selling a variety of products and services. Products and services for sale via the electronic marketplace are usually described in documents (product descriptions) submitted by the various merchants. FIG. 1 shows an illustrative network environment 100 of a hosted electronic marketplace 102 where multiple merchants 104-106 can offer products and services for sale to consumers connecting to the electronic marketplace via user computers 108-110 over a network 112. The consumer will typically browse the various products and/or services using a browser, which displays product descriptions about the products. These product descriptions, such as product descriptions 114-116, are provided by the merchants 104-106 to the electronic marketplace over the network 112. Typically, the electronic marketplace 102 will store the product descriptions 114-116 in a data store (not shown) that is referred to hereafter as a document corpus (i.e., a body of product descriptions/documents).
With regard to the product descriptions submitted by the merchants to the electronic marketplace, these documents are typically structured in a manner such that the electronic marketplace can identify and extract relevant information in order to categorize and display the information to consumers. Included in the structured data is typically a collection of attribute/value pairs according to the type of product or service for sale. For example, books are described by their ISBN, title, contributors, publication date, binding, publisher, volume, edition, and several other attributes, each of these attributes forming attribute/value pairs.
While merchants provide attribute/value information regarding the products or services that they want to sell, two different merchants will seldom agree on a common set of attribute/value pairs regarding the same product. Moreover, even when they seemingly provide the same information, the content, data, and/or semantics of the various attributes can vary widely. For example, while many merchants desirably provide a “part number,” one merchant may choose to provide the part number of a product in a “part number” attribute field, a second merchant might provide the part number for the very same product in a “catalog number” attribute field, and a third merchant might provide the part number in a “model number” attribute field. Further still, a common source of inconsistency of product descriptions of the same product from different manufacturers relates to the “manufacturer name” and “brand name” attribute fields. Simply stated, merchants differ substantially in what they place in these attribute fields. In short, attribute fields may be used similarly or synonymously by some merchants and used to mean two widely different things by other merchants.
Another common source of inconsistency is in the “title” attribute that is meant to serve as a short description for the same product. Indeed, merchants often associate different semantics with this attribute. Some merchants construct the “title” attribute field using the brand name, the part number and the noun phrase to describe the product, such as “Sanitaire SC684 Upright Vacuum Cleaner.” Other merchants will omit the brand and part number information in the “title” attribute field, but use it instead to provide information about salient features of the product.
Clearly, it is desirable for an electronic marketplace 102 to match product descriptions of a first merchant to product descriptions of a second merchant when they describe the same product (or service). Indeed, when a consumer (via a user computer 108) browses in the electronic marketplace 102 in search for “Item X,” all instances of Item X should be available to the user from a single display location. This requires that the electronic marketplace 102 identify “duplicate” product descriptions from multiple merchants. By “duplicate” it is meant that a first product description describes the same or substantially the same product as described in a second product description. Unfortunately, given such inconsistencies between merchants in regard to the information describing a product or service in a product description, any service that attempts to establish similarity between two product descriptions on the basis of a strict comparison of attribute fields between two product descriptions will have very poor results in identifying those documents that are duplicates. Identifying documents that are (at least potentially) duplicates is referred to as “recall.” On the positive side, strict attribute field comparisons will yield very accurate results, i.e., the potential duplicates will likely be true duplicates. Identifying true duplicate product descriptions is referred to as “precision.”
In contrast to simple attribute field comparisons, completely ignoring structure, particularly ignoring the attribute/value pairs, and comparing all terms in a product description to another product description, “solves” the issue of poor recall. One example of a system employing a non-fielded comparison between two documents is described in commonly owned and co-pending U.S. patent application Ser. No. 11/754,237, filed May 25, 2007, entitled Duplicate Entry Detection System and Method, and U.S. patent application Ser. No. 11/754,241, filed May 25, 2007, entitled Generating Similarity Scores for Non-Identical Character Strings, which are incorporated by reference. However, completely disregarding the structure information in product descriptions diminishes the precision of a comparison engine.