Web sites present information on various topics in various formats. A great amount of effort is often required for a user to manually locate and extract useful data from the web sites. Therefore, there is a great need for value-added services that integrate information from multiple sources. For example, such services include customizable web information gathering robots/crawlers, comparison-shopping agents, meta-search engines and news bots, etc.
To facilitate the development of these information integration systems, good tools are needed for information gathering and extraction. In situations where data has been collected from different web sites, a conventional approach for extracting data from various web pages uses programs called “wrappers” or “extractors” to extract or excerpt data items, or “features,” from the contents of the web pages.
For example, an extractor might attempt to categorize different data items that occur within a particular web page. If the web page comprises an advertisement for an employment opportunity, for example, then the extractor might attempt to locate, within the web page, separate data items that fit into “job title” and “job location” categories. The extractor might attempt to categorize data items on multiple separate web pages in this manner. When the extractor locates a data item that the extractor deems to fit a particular category, the extractor may insert that data item into a search index, and establish an association between that data item and the category that the data item is deemed to fit. When a user later queries a search engine, the search engine may consult the search index to find search results in which the user may be interested. The accuracy and completeness of the contents of the search index strongly influences the relevance and value of the results.
For a particular web page and a particular category, the extractor might or might not be able to locate, on that page, a data item that fits that category. If the criteria used to identify a data item that fits a particular category are not well adapted to the construction of the page, then the extractor might mistakenly determine that a data item other than the “correct” data item fits the category. For example, the extractor might mistakenly determine that the “job location” data in a web page (rather than the actual “job title” data in that web page) fits into the “job title” category.
Based on how many of the criteria that the data item selected for a category satisfies, the extractor might assign, to the selected data item, an indication of how likely it is that the data item actually was the “correct” data item on the page—how likely it is that the data item actually did fit the category. This indication is commonly called a “confidence measure.” Data items that are very likely to be the “correct” data items may be associated with relatively high confidence measures, while data items that are less likely to be the “correct” data items may be associated with lower confidence measures. If the confidence measure for a particular data item is lower than a certain threshold, then the extractor might refrain from inserting the data item into the search index at all.
After an extractor has automatically populated the search index, the search index may contain some incorrect entries, and may omit some correct entries. One approach for revising the search index involves employing a human being to look through the extracted data items manually, determine which data items have relatively low confidence measures, read the pages from which the low-confidence data items were excerpted, and determine whether any data items in those pages actually do fit the categories at issue. Although human beings are consistent and accurate in some cases, they usually operate slowly, and they can cost a considerable amount of money to train and maintain. Some human beings are less consistent and accurate than others, especially after they have been working uninterrupted for long periods of time. Mistakes happen.
Other approaches for revising the search index rely on the web pages being formatted in a known way, and, as a result, are inapplicable if the web pages are not formatted in that known way or if the structure of the web pages deviates over time from that known way. For example, some approaches might require the web pages to be HTML documents that conform to a specified scheme. These approaches fail when applied to documents that are not in HTML or which depart from the scheme even to a minor extent, sometimes due to changes in the documents after the extraction process has occurred.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.