1. Technical Field
This disclosure relates to methods and systems supporting online searching and transactions. More particularly, the present disclosure relates to identification of near duplicate user-generated content in a networked system.
2. Related Art
Electronic shopping systems currently exist which permit merchants to sell inventory to consumers over a computer network. Merchants now use computers to publish information about their products on one or more electronic pages (e.g., text and graphics displayable on a computer screen) and to elicit product orders from consumers. Likewise, consumers use computers to access information describing products and to communicate orders to a merchant.
With the increasing popularity and accessibility of the Internet, and particularly the World Wide Web, the number of merchants using and desiring to use the World Wide Web to advertise and sell products is growing rapidly. The World Wide Web is a global information system in which information is exchanged over the Internet using a set of standard protocols. An existing Web-based electronic store typically comprises a collection of Web pages which describe inventory (e.g. listings) and which include on-line forms allowing consumers to place orders or bids. Consumers use Web browsers to access the Web pages of electronic stores to examine information about available products and/or services (e.g. listings) and to submit product/service orders.
Merchants attempt to accurately describe their products or services in listings so the listings will be found by a high percentage of potential buyers who may be searching for similar products using network search engines. However, sellers often do not describe their offerings in a manner that maximizes their exposure to a large number of buyers. Further, on-line searching can be complicated by the large number of sellers, large number of product/service offerings, and the rapidly changing e-commerce marketplace. Sometimes, sellers may erroneously or intentionally post listings that are duplicates or near duplicates of existing listings to gain greater exposure without paying for the additional listings. These problems can also be encountered in other forms of user-generated content such as forums, blog comments, product reviews, and the like.
U.S. Pat. No. 6,484,149 describes a system and method for designing and operating an electronic store to (1) permit a merchant to organize and advertise descriptions of product inventory over the Internet, (2) permit Web page information to be extracted on-demand from a product inventory database, and (3) permit Web pages to be automatically customized to fit shopping behaviors of individual consumers. A graphical store design user interface of a Web browser displays a hierarchical representation of products and, product groups of an electronic store. A user manipulates icons of the Web browser store design user interface to cause a Web server to modify relationships between products and product groups stored in a product information database. A store designer creates HTML template files, embeds database and customize references within the template files, and assigns template files to groups or products of the electronic store.
U.S. Pat. No. 6,038,668 describes a networked catalog search, retrieval, and information correlation and matching system. The system allows suppliers to publish information in electronic catalogs, structure the information in an object oriented representation distributed across a network of computers, for example, the Internet. The system also enables customers to search and retrieve information on products and suppliers which match dynamically specified customer requirements. Through retrieving compliant HTML pages, a search engine forwards retrieved pages to an object oriented database which sorts received information by the information's internal organization structure. By searching the information as stored in the knowledge base, a user may quickly retrieve the stored information as highly tailored to the user's search strategy.
Thus, a computer-implemented system and method for identification of near duplicate user-generated content in a networked system are needed.