Several methods have been proposed for extracting data from semi-structured documents—documents which do not have a completely regular and static structure. For example, methods are known for extracting required data from Internet Web sites using “wrappers” (specialized computer program routines that automatically extract data from the Web sites). According to some estimates, over 80% of the published information available via the Web (i.e. the World Wide Web Internet service) is based on databases that run in the background. The structure of the underlying database is lost in the process of generating HTML pages. Wrappers try to reverse this process by extracting relevant data from HTML pages and reconstructing the structure—mapping the HTML source to a set of semi-structured (or structured) database objects that can be queried and manipulated by applications.
Most wrapper-based methods represent Web pages as a sequence of tokens that include strings and HTML tags. The methods then involve constructing a representative label for the desired data elements. The representative label provides a way to identify desired data within a given document based on the structure of the document. These representative labels can be created either manually or semi-automatically via a graphical user interface.
The representative labels of relevant data fields can also be used as characteristic features of a document and a classification algorithm can be used to classify the documents in a given document collection based on such features.
The Web is extremely dynamic and continually evolving, such that there are frequent changes in the structure and content of Websites and documents. A commercial Web site may be updated to apply new Web page design techniques, to add a description of new product features, to change the page layout, or to correct errors. Consequently, representative labels that use specific structural information relating to a document (such as specifying the location of information within a page) must be updated regularly in order to maintain the desired functionality of conventional wrappers. However, updating the labels is a cumbersome and time consuming process.
Davulcu et al in “Computational Aspects of Resilient Data Extraction from Semi-structured Sources”, Proceedings of 19th ACM SIGMOD Symposium on Principles of Database Systems (PODS), 2000, Dallas, Tex., US, pages 136-144, present a formal framework for creating resilient data extraction wrappers for semi-structured data. They propose the notion of extraction expressions which are tag-marked regular expressions and are used to identify the desired data. Davulcu et al use the following two-stage strategy to find the resilient extraction expression for desired data in a document. In the first stage, several perturbations to the given document are made and extraction expressions for the desired data in all perturbations are determined. In the next step, Davulcu et al try to generalize these extraction expressions into a single extraction expression that matches all the perturbed instances of the document. Davulcu et al further introduce the notion of “unambiguity” as a consistency requirement for the generalized extraction expression.
The method disclosed by Davulcu et al considers a specific set of perturbed pages—apparently relying on the assumption that checking “unambiguity” of a generalized extraction expression for a specific class of perturbations will provide an acceptable resilient extraction expression. This is not always the case in practice. Davulcu et al mention other limitations of their techniques, and express uncertainty regarding whether maximisation of resilience can be determined.