1. Field of Art
The present invention generally relates to the field of digital information processing, and more specifically, to identifying and extracting information of interest from web pages.
2. Background of the Invention
Users rely on search engines and other information retrieval systems, such as that provided by GOOGLE, to provide comprehensive and accurate information that is of relevance to them. One important type of information relates to the availability of products and services (hereinafter collectively referred to simply as “products”). For example, users often explicitly submit queries for particular products of interest and wish to see information about the products, such as pictures, reviews, prices, availability, and the like. In other contexts, an information retrieval system may present the user with product information related to other information that the user is viewing. For example, if a user entered a query about digital photography via a search engine, the search engine might include advertisements for various digital cameras as part of the provided search results.
However, in order for an information retrieval system to provide comprehensive and accurate product information to users, the system must have up-to-date, accurate information on a wide range of products. Since key product information, such as price and availability, is known only to the merchant selling the product or service, the information retrieval system must thus have a way of obtaining the product information from the various merchants.
Some merchants provide certain information retrieval systems with updates (also known as “feeds”) regarding their various products, including the products' titles, prices, and quantities in stock. Unfortunately, relying on merchants for such updates leaves a number of problems unsolved. For example, the updates are often too infrequent to quickly account for changes in product information. Thus, after a change in product price (for example), the information retrieval system will continue to report the old, out-of-date price until the next update for the product is received from the product merchant. Similarly, out-of-stock items may incorrectly be reported as in-stock (or vice versa). This leads to user dissatisfaction after the users discover that, contrary to what the information retrieval system reported, they cannot (for example) purchase the product for the listed price. As an additional problem of relying on merchant updates, the information provided in the updates can also be inaccurate. For example, the updates may be generated manually by employees of the merchants, rather than automatically by a program, leading to inadvertent (or possibly intentional) inaccuracies. As a still further problem, merchants often provide updates on only a small subset of their products, and thus the information retrieval system gains no information at all about the remainder of the products.
Merchants typically store the up-to-date product information in their own product databases and use it to automatically generate product web pages as part of their own web sites. Users can use these product web pages to view detailed information about the products, read reviews, purchase the product, and the like. Among the detailed product information provided by the product web pages is the information of interest to the information retrieval system, such as product title, price, and availability. However, it is very difficult to automatically identify the information of interest from amongst all the other information presented by the product page. Even for a particular distinctive type of information, such as numerical price information, a given page typically presents a number of prices, such as prices for related products or non-discounted list prices, making it difficult to identify the actual price of the product of interest. To circumvent this difficulty, humans designing a search retrieval system may manually study the product web pages of a particular merchant to identify unique characteristics of information of interest. However, such manual analysis is expensive, time-consuming and has very limited utility. For example, manual analysis can at best address only a small number of merchants relative to the vast number of distinct merchants offering products, given the time required to analyze the product web pages of each merchant. Additionally, merchants may frequently alter the way that they present information on their product web pages, rendering the prior manual analysis obsolete and requiring completely new analysis.