The World Wide Web contains billions of pages of freely available information, such as airplane arrival times, baseball statistics, and product descriptions. However, much of that information is embedded in running prose intended for reading by humans. A human is best equipped, for example, for locating the information on a Web page, giving it a data type (whether “1938” is a calendar year, the price of a product, or an airline flight number), and relating it to other data (“this picture located here depicts that product located there”). This manual process is time-intensive and error-prone.
There are current two ways to extract data automatically from a Web page, a process which is called “Web scraping”. First, every Web page contains hidden mark-ups for formatting, such as boldface and italics. Theoretically, these mark-ups can help a computer algorithm locate information on a page. Unfortunately, every Web site has a different look and feel, so each Web page needs its own custom algorithm. Writing a custom algorithm is time-intensive, but possible on a small scale, such as a price comparison website which gathers product information from a dozen sources. But there is no efficient way to scale this approach up to thousands or millions of Web sites, which would require thousands or millions of custom algorithms to be written.
The second method requires the owner and developer of each Web site to add hidden mark-ups that specifically designate information and its data type. The preferred technology for this is XML. Unfortunately, nearly all Web sites are not built this way, and there are no standardized terms for XML usage. It is a little like saying that if only everyone would speak Esperanto, there would be no translation problems. This is true in theory, but hopelessly impractical.
Once data has been collected, there are no good mechanisms for disseminating it. Every Web site that publishes information stands alone. Each publisher writes its own content, sells its own ads, and manages its own online community. Web publishers such as Amazon.com that include others' book reviews, and such as The Boston Globe that include others' newswire stories, require their partner's active participation to integrate their databases together. This function is also quite difficult to scale up to millions of potential partners and the trillions of possible bilateral partnerships between those potential partners. The matter becomes even more complicated when advertisements, which can come from thousands of sources, need to be associated with data and presented to the end-users who read the publisher's Web site. Finally, there is currently no easy way for the online communities of various Web sites to profit from each other's knowledge, forming a “meta-community” which could, for example, automatically share movie reviews and ratings across thousands of movie fan Web communities.