The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
Search engines, such as Google, Bing, and others search and index vast quantities of information on the Internet. “Crawlers” (a.k.a. “spiders”) utilize URIs obtained from a “queue” to obtain content, usually from web pages. The crawlers or other software store and index some of the content. Users can then search the indexed content, view results, and follow hyperlinks back to the original source or to the stored content (the stored content often being referred to as a “cache”). Computing resources to crawl and index, however, are not limitless. The URI queues are commonly prioritized to direct crawler resources to web page servers which can accommodate the traffic, which do not block crawlers (such as according to “robots.txt” files commonly available from webpage servers), which experience greater traffic from users, and which experience more change in content.
Conventional search engines, however, are not focused on price and product information. If a price changes on a webpage, but the rest of the webpage remains the same, traditional crawlers (or the queue manager) will not prioritize the webpage position in the queue, generally because the price is a tiny fraction of the overall content and the change is not labeled as being significant; conversely, if the webpage changes, but the price and/or product information remains the same, the change in webpage content may cause a traditional crawler to prioritize the webpage position in the queue due to the overall change in content, notwithstanding that that price and product information remained the same.
Conventional search engines, if presented with a query, will find corresponding products. For example, it is possible to search for “men's shoes” and to then be presented with a webpage comprising search results for hundreds of thousands of webpages for men's shoes. The search result may further be narrowed by category of men's shoes, brand, and store. Search engines have been incorporated into online stores, wherein a user may search for products, by keyword and/or by category and results can be ordered by price.
Price history, however, is only narrowly viewed and, when it is, never in the context of a rich attribute set which explores, in detail, which attributes are associated with changes in price. Price histories are not made available in real time, and do not allow intricate comparisons based on stores, merchants, brands, regions, time/date, and other dimensions.
When product and price data is obtained from a large number of webpages, when the webpages contain a large number of records, and when data from the large number of records is processed to discover product and price relationships which can only be teased out via data sets encompassing large swaths of economic activity, batch-based data ingestion and indexing processes which occur across days and cascading analytic dependencies will introduce delays. Such delays prevent the resulting corpus from being searched in close-to-real time. Customers who desire to have new webpages searched and to benefit from discovering product and price relationships will be frustrated by batch process and cascading dependency delays; such customers will have reduced confidence that product and price relationships are up-to-date.