A wrapper is a type of software component or interface that is tied to data which encapsulates and hides the intricacies of an information source in accordance with a set of rules. Wrappers are associated with the particular information source and its associated data type. For example, HTTP wrappers interact with HTTP servers and HTML documents; JDBC wrappers work with ODBC-compliant databases; and DMA wrappers work with DMA-compliant document management systems.
The World Wide Web (Web) represents a rich source of information in various domains of human activities and integrating Web data into various user applications has become a common practice. These applications use wrappers to encapsulate access to Web information sources and to allow the applications to query the sources like a database. Wrappers fetch HTML pages, static or ones generated dynamically upon user requests, extract relevant information and deliver it to the application, often in XML format. Web wrappers include a set of extraction rules that instruct an HTML parser how to extract and label content of a web page. These extraction rules are specific for a given Web provider and therefore may be tightly linked to the layout and structure of the provider pages.
When a wrapper is generated, it is assumed that the layout and structure of the document pages do not change. However, Web page owners frequently update and revise their pages, which often involves changing the layout and structure of their pages. Wrappers become brittle when the page mark-up or layout or structure is changed. When the wrapper is brittle, the wrapper may fail to find specific “landmarks” in the page and may fail to apply the corresponding extraction rules, thus becoming inoperable and incapable of completing the task of information extraction. When the wrapper is broken, it must be repaired. However, users find that it is often easier to relearn or regenerate a broken wrapper than to repair it. However, relearning requires user intervention that is not always available. Moreover, a regenerated wrapper is not scalable if changes occur frequently.
Wrapper maintenance is challenging when provider pages undergo massive and sweeping modifications, due to, for example, a complete site re-design. A re-designed site will usually require regenerating the wrapper. However, most changes to Web pages are small and localized in nature, including small changes in the page mark-up, small changes in the content information, and possibly the addition or deletion of a label. It would be desirable to have a method of generating a wrapper with integrated maintenance components capable of recovering, automatically when possible, from small changes.
One solution to the problem of wrapper maintenance detects page changes within a defined level of accuracy. When the change is detected, the designer is notified so that the wrapper can be regenerated from samples of the changed pages. This solution requires user intervention. Another solution for wrapper repair finds the most frequent patterns (such as starting or ending words) in the content of labeled strings and then searches for these patterns in a page when the wrapper is broken. It would be desirable to have a method for wrapper repairing that accurately and automatically repairs wrappers in a large number of situations.