Many companies selectively extract data from web documents. For example, a typical shopping comparison web service obtains desirable information, such as product names, model numbers, and prices, from the web pages of various online retailers. The shopping comparison service will then reorganize and list this information so that its visitors may easily compare the pricing of similar products by multiple vendors.
Such functionality requires a reliable means of finding desired information in web documents with different structures. The structure of a web document can be represented as an XML document tree. FIG. 1 displays a simplified example of such a tree 100. Tree 100 pertains to a web page for a site that may be classified as being about a movie or movies. Information can be extracted from tree 100 by referring to nodes in the tree structure. For instance, to extract the director name from tree 100, we can use the following XPath expression,W1≡/html/body/div [2]/table/td[2]/text( )  (1)which specifies how to traverse trees having a structure similar to tree 100. In particular, the XPath expression W1 above starts from the root, follows the tags html, body, the second div under body, the table under the second div, the second td under table and then the text( ) under the second td. The path expression W1 is often called a wrapper.
While the conventional use of wrappers is an effective way to extract information, it suffers from a fundamental problem: the underlying web pages frequently change, often very slightly, which may cause the wrapper to “break” i.e., the path of the wrapper no longer leads to the desired data item in a web page. As a result, a new wrapper must be learned to accommodate the changes in the web page. For instance, consider the wrapper W1 above. It breaks if the structure of tree 100 changes in any of the following ways: a new div section gets added before the content section, or the first div is deleted or merged with the second div, a new table or tr is added under the second div, and so on. Websites are constantly going through small edits and thus the breaking of wrappers is often a problem.