Web pages provide a highly flexible and effective medium for presenting information. The information on any particular web page is generally not, however, optimized for substantive analysis by machine or computer.
One type of substantive analysis of a web page that can be automated is the extraction of information from web pages. The extracted information may, for example, include a description or attribute of a news article, product, service, job listing, company, person, or any other type of item that might appear on a web page. Prior technology has often relied upon regular expression matching, which can be unreliable and which may require substantial processing. Other prior technology has tried to use structural information available in web pages to improve the extraction accuracy and lower the associated processing requirements.
U.S. Patent Pub. 2002/0143659, which is owned by the assignee of the present application and is incorporated herein by reference, describes methods by which a structural graph representation of a sampled web page, such as a Document Object Model (DOM) representation, may be used to create an extraction rule for extracting data from web pages with a similar structure. These methods take advantage of the similarity in web page structure that is common among groups of web pages of the same web site. One limitation with this approach, however, is that a rule generated from a single sampled web page sometimes will not be capable of reliably extracting desired data values from other web pages.