One aspect associated with the widespread usage of networks generally, and the Internet particularly, has been the emergence of electronic marketplaces. An electronic marketplace is typically a network site that offers a consistent, seemingly united, electronic storefront to networked consumers. Typically, the electronic marketplace is hosted on the Internet as one or more Web pages, and viewed by a consumer via a networked computer. FIG. 1 is a pictorial diagram of an illustrative networked environment 100 that includes an electronic marketplace. In particular, the networked environment 100 includes a host server 102 that hosts the electronic marketplace 104. As indicated above, a typical electronic marketplace is comprised of one or more Web pages that are viewable on a consumer's computer via a Web browser. However, for illustration purposes, the electronic marketplace 104 is shown in FIG. 1 as residing “outside” of a client computer. Consumer computers, such as consumer computers 106-108, connect to the host server to access the electronic marketplace via a network 110, such as, but not limited to, the Internet. The electronic marketplace 104 allows consumers, via their client computers 106-108, to view and purchase items offered for sale or lease on the electronic marketplace.
In many instances, an electronic marketplace 104 includes items from many different vendors or suppliers. For example, as shown in FIG. 1, the electronic marketplace 104 offers items from vendors 112 and 114. Still further, these electronic marketplaces allow individuals to offer both new and used items to consumers via the electronic marketplace. To do so, the vendors/suppliers 112-114, as well as consumers, such as a consumer operating consumer device 108, provide descriptions of products to be offered on the electronic marketplace 104 to the host server 102. The illustrated descriptions include descriptions 120-124.
Naturally, if an item is offered through the electronic marketplace 104, all instances of that item from all vendors should be displayed to the consumer as various options of the same item rather than individual items that are viewed separately. Unfortunately, since individual vendors and consumer/sellers provide the host server 102 with their own descriptions of the products that they wish to sell, it becomes an onerous, manual task to determine which product descriptions reference the same items and which reference different items. For example, FIGS. 2A-2C present illustrative product description documents submitted from two separate vendors. As suggested by the illustration, document 202 of FIG. 2A includes a structured or fielded document with information organized in a structure, such as manufacturer 204, model number 206, screen size 208, case color 210, and a brief description 212. Document 220 of FIG. 2B is not structured or fielded, but rather a free-form paragraph description (typical of product descriptions provided by consumers) that includes important information. With regard to documents 202 and 220, and upon inspection of the two documents, a person familiar with the subject matter of laptops (or even one not quite so familiar) is likely to recognize that these two documents likely describe the same product. In other words, a person would recognize that the manufacturer (“HP”) identified in the manufacturer field 204 and the name “Hewlett Packard” in text area 222 are a reference to the same manufacturer. Similarly, a person would likely recognize that the case color “BLK/SLVR” in the case color field 210 is the abbreviation for “Black/Silver” as recited in full in text area 224. From comparisons of other terms/fields, while not necessarily resulting in a letter-perfect match, a person would recognize the two documents as being substantially similar, i.e., describing the same or substantially the same product or subject matter. Moreover, if these descriptions were properly identified as duplicates (i.e., that the subject matter described by both documents is the same), a host server 102 would group them together as descriptions of a single product item.
Document 230 of FIG. 2C is a structured document and includes fields that are very similar to that of document 202. However, in contrast to document 202 (and to document 220), there are certain differences between the two that a person would likely recognize and conclude that they describe different products. For example, the case color field 232 recites “BLK/SLVR/GLD,” adding an additional color to the case. Additionally, the product description 234 includes additional language, “limited edition,” in text area 336 that would indicate that this laptop, in contrast to the one described in document 202, is somewhat different (i.e., a limited edition version) and not a duplicate.
Unfortunately, while a person can be trained to discern the differences between duplicate product descriptions, it is difficult for a computer to programmatically analyze two documents to determine whether or not they are duplicates (i.e., whether or not they describe the same product). Clearly, this problem is exacerbated when the number of products offered by an electronic marketplace 104 (originating from a myriad of vendors) is measured in hundreds of thousands or more.