The World Wide Web of interconnected web pages can be said to be a gigantic group of connected databases. Each page contains some bit of interesting, or not-so-interesting, information that web surfers can stumble upon and read. Internet search engines use spiders to crawl across the web, jumping between pages using the hyperlinks that point from one page to the next and that form the web. Such search engines may then index the content on the pages so that user requests for similar content, or concepts, can be served by returning the page in a list of search results.
The information on web pages, however, is generally unstructured. In other words, it typically is not formatted in a single, nice database table in which every column has a nice field definition, and every row represents a data item. Rather, web information is formatted for reading, like a book, and not necessarily for analysis, like a spreadsheet. As a result, it can be difficult to sort the data in a web page from the formatting and text that is used to explain the data. Moreover, when the data is spread across multiple pages, users may be required to view numerous web pages in order to understand a full data set that has many pieces of data. In addition, search engines may have a more difficult time classifying the data when it is broken up into an unstructured form and spread across multiple pages.
For example, pricing data for goods on web pages may be split up so that one good appears on each web page of a web site. Such organization may make it difficult to compare prices between multiple goods (e.g., different brands of pens on an office supply web site).