1. Field of Invention
This disclosure teaches techniques related in general to the field of information processing. More particularly, the teachings relate to methods, systems and computer-program products for information extraction from Web pages, to the construction of wrappers (i.e. extraction programs), based on example Web pages, and to the transformation of “relevant” parts of HTML documents into XML.
2. Basic Concepts, Terminology, and Introduction
The World Wide Web (abbreviated as Web) is the world's largest data repository and information retrieval system. In this environment, client machines effect transactions to Web servers using the Hypertext Transfer Protocol (HTTP), which is an application protocol usually providing user access to files formatted in a standard page description language known as Hypertext Markup Language (HTML). HTML provides basic document formatting (and some logical markup) and allows the developer to specify “links” to other servers and documents. In the Internet paradigm, a network location reference to a server or to a specific Web resource at a server (for example a Web page) is identified by a so-called Uniform Resource Locator (URL) having a well-defined syntax for describing such a network location. The use of an (HTML-compatible) browser (e.g. Netscape Navigator, Microsoft Internet Explorer, Amaya or Opera) at a client machine involves the specification of a link by the means of an URL. The client then makes a request to the server (also referred to as “Web site”) identified by the link and receives in return an HTML document or some other object of a known file type. Simple and less sophisticated browsers can easily be written in a short time and with little effort in (object-oriented) programming languages such as Java, where powerful program libraries are available that already contain modules or classes providing the main functionalities of browsers (for example, the JEditorPane class of the javax.swing package).
Browsers or other applications working with HTML documents internally represent an HTML document in the form of a tree data structure that basically corresponds to a parse tree of the document. A model for representing and manipulating documents in form of trees is referred to as Document Object Model (DOM). Several DOMs for HTML documents have been defined and are used by different programming environments and applications, but the differences among these DOMs are rather inessential. An example for such a DOM is the so called “Swing DOM”, which is part of the Javax Swing Package, a programming package containing useful libraries of Java classes for manipulating HTML documents. A DOM tree of an HTML document represents its hierarchical structure. In particular, the root of a DOM tree of an HTML document represents the entire document, while intermediate nodes of the tree represent intermediate elements such as tables, table rows, and so on. The leaves usually represent terminal (i.e., structurally indecomposable) data such as atomic text items or images. Each node of an HTML DOM tree can be associated with certain attributes that describe further features of the represented element (such as style, font size, color, indentation, and so on).
One important disadvantage of HTML is its main orientation as formatting and layout language, but not as data description language. In fact, the nodes of an HTML DOM tree are predefined elements that basically correspond to HTML formatting tags. Therefore it is difficult and very cumbersome (if at all possible) to query an HTML document using query languages in order to automatically extract useful and hierarchically structured information. Given that HTML provides no data description nor any tagging or labeling of data except for formatting purposes, it is often difficult and sometimes impossible to formulate a query that allows a system to distinguish, say, a first name from a family name or from an address appearing in the same HTML document. For this reason, web documents which are intended to be queried or processed by software applications are hierarchically organized using display-independent markup. Such, so-called semistructured, documents are often more suitably formatted in markup languages such as XML (eXtensible Markup Language).
XML is a standard for data exchange adopted by the World Wide Web Consortium (W3C) in 1999. The main advantage of XML is that it allows a designer of a document to label data elements using freely definable tags. The data elements can be organized in a hierarchy with arbitrarily deep nesting. Optionally, an XML document can contain a description of its grammar, the so-called Document Type Definition (DTD). An XML document or a set of such documents can be regarded as a database and can be directly processed by a database application or queried via one of the new XML query languages such as XSL, XSLT, XPath, XPointer, XQL, XML-QL, XML-GL, and XQuery. Moreover, powerful languages such as XSLT do not just serve for defining queries but can transform their output into an appropriate format suitable for further processing, e.g. into an email message or a piece of plain text to be sent to a cellular phone.
Note that most Web pages are still formatted in HTML. This is not expected to change soon, even though XML has been attracting a lot of attention. One reason for this may be that, due to the limited syntax of HTML, this language is somewhat easier to learn and to use than XML. Moreover, HTML documents are very often designed by laypersons, i.e., non-programmers, who are not suitably trained in the logical skills to systematically define data structures as required by XML and who therefore feel more comfortable using widely available editors and tools such as Dreamweaver, Frontpage or HotMetal in order to create HTML Web pages in a “what you see is what you get” manner. Furthermore, document designers often do not anticipate the need of others to process their documents automatically but mainly have a human beholder of their Web pages in mind. Finally, many companies deliberately refrain from offering data in XML format in order to obstruct automated processing of the published data by others.
On the other hand, there is a tremendous need for automating Web data processing and monitoring tasks. In the Business to Business (B2B) context it is often of crucial importance to a company to be immediately informed about price changes on the Web site of a competitor, about new public offerings or tenders popping up on a Web site of some corporate or government institution, or about changes in exchange rates, share quotas, and so on. Similarly, individuals can heavily profit from automated web monitoring. For example, imagine one would like to monitor interesting notebook offers at electronic auctions such as eBay (http://www.ebay.com). A notebook offer is considered interesting if, say, its price is below GBP 3000 (Great Britain Pounds), and if it has already received at least two offers by others. The eBay site allows one to make a keyword search for “notebook” and to specify a price range in USD (US Dollars) only. More complex queries such as the desired one cannot be formulated. Similar sites do not even give restricted query possibilities and leave you with a large number of result records organized in a huge table split over dozens of Web pages. One has to wade through all these records manually, because of no possibility to further restrict the result.
All these problems could be solved efficiently if the relevant parts of the respective source data were made available in XML format.
Thus, there is a significant need for methods and systems that are able to perform some or all of the following four tasks:                1. Identify and isolate relevant parts or elements of (possibly remote) Web pages.        2. Automatically extract the relevant parts of Web documents even though the respective documents may continually change contents and even (to a certain extent) structure.        3. Suitably transform the extracted parts into XML to make them available for querying and further processing.        4. Assist a developer or application programmer in creating and using programs or systems able to perform tasks (1), (2), and (3). A subtask of central importance is supporting the developer in the definition of relevant extraction patterns. Extraction patterns serve to identify information of one particular kind.        
Tasks (1) and (2) together are often referred to as “Web information extraction” or also as “data extraction from the Web”. Task (3) is referred to as “translation into XML”. Note that a useful and meaningful translation into XML does not merely consist of reformatting an HTML document, according to the XML standard, but also in enriching the document with structural information and data description tags. The translated document will thus contain some structural and descriptive information that is not present in the original document.
A program specifying how the above tasks (1), (2), and (3) are to be performed is referred to as “wrapper” or “wrapper program”. Wrappers may be written in a publicly available multi-purpose (procedural) programming language with primitives able to manipulate web resources (such as Java, C++, or Perl) in which case they can be compiled (or interpreted) and executed in a regular fashion using standard software resources (just as other programs in that language). Alternatively, wrappers can be formulated in some dedicated or proprietary high-level declarative language that needs a specially constructed interpreter or compiler.
A program or system that automatically or semi-automatically generates wrappers is referred to as “wrapper generator”. A software tool that merely assists a human in manually programming and testing a wrapper, is referred to as “wrapper programming assistant”. Task (4) can be solved by means of a wrapper generator, by means of a wrapper programming assistant, or by some hybrid tool.
3. Desirable Properties of Methods and Systems for Wrapper Generation and Web Information Extraction
It is desirable to enable a very large number of computer users, including laypersons having no programming skills or expertise on HTML or similar formats, to create robust wrappers using a small number of sample pages, such that these wrappers are then able to automatically extract relevant and complex parts of Web pages and to translate the extracted information automatically into XML. With respect to this goal, a method or system for wrapper generation, Web data extraction, and translation into XML should fulfill at least some of the following properties:                High expressive power. The system should enable the definition of complex, structurally organized patterns from Web pages and translate the corresponding data (the so-called pattern instances) into a corresponding hierarchically structured XML document.        User friendliness. It should allow a human wrapper designer to design, program, or specify wrappers in a very short time. The user interaction should be efficient and suitable for constructing wrappers and specifying the XML translation.        Good learnability. The learning effort for being able to understand the method or use the system should be as small as possible. The method or system should be accessible to, and usable by, a layperson who is not a programmer or a computer scientist and has no programming experience. In the best case, it should not even require knowledge of HTML or XML, which means that a designer is never directly confronted with HTML or XML code (even the XML output can be displayed using nested tables).        Good visual support. It should offer the wrapper designer a GUI (graphical user interface) for specifying wrappers or XML translations. Ideally, the visual user interface allows a wrapper designer to work directly on displayed sample source documents (e.g. on HTML Web pages) and supports a purely visual way of defining extraction patterns.        Ease of accessibility and installation. The system should be widely accessible and should not require particular installation efforts. Ideally, the system provides an interface so that it can be used through a standard Web browser such as Netscape or Internet Explorer.        Parsimony of samples. In case the method or system uses sample pages as a basis for constructing wrappers, it should require only very few of these (a single one at best) for most applications. The reason is that, in many cases, a wrapper designer has only one or very few sample pages at hand. For example, if we decide to construct a wrapper to translate the homepage of the United States Patent and Trademark Office (USPTO) available at http://patents.uspto.gov/into XML (e.g. in order to monitor upcoming new information and press releases and new federal register notes), then, at the time of wrapper construction, one instance of this page will be available at hand, namely, the current page. It should be possible to construct a wrapper based on this single instance which works well for future versions of this page.        Robustness. Wrappers are generally aimed at extracting information from similarly structured Web pages of changing content. It is obvious that wrappers risk failing to deliver a correct result if the structure of the source documents changes. However, a good wrapper is expected to have a certain degree of robustness, i.e., insensibility to minor structural changes. The method or system should allow the generation of fairly robust wrappers.        Runtime Efficiency. The method should provide efficient algorithms and the system should implement these algorithms efficiently such that the system becomes usable in practice and is highly scalable. (This is, of course, a general requirement to be fulfilled by almost all software methods and systems).        Smooth XML Interface. The method or system should provide a smooth and user-friendly way of translating the extracted data into XML in order to make it accessible to further processing, e.g. via XML query engines or well-known transformation languages such as XSLT. Ideally, the translation to XML is done automatically on the basis of the information gathered from the designer during the process of defining extraction patterns.        
Clearly, a method and system fulfilling all these requirements is highly desirable and useful. In the paper “Content Integration for E-Business” (M. Stonebraker and J. M. Hellerstein “Content Integration for E-Business”, Proceedings of SIGMOD 2001) some of the challenges needed for content integration are presented:”. In short, a powerful, easy-to-use tools, is needed to address the broad challenges of cleaning, transforming, combining and editing content. These tools must be targeted at typical, non-technical content managers. In order to be useable the tools must be graphical and interactive, so that content managers can see the data as it is mapped. Any automated techniques must be made clearly visible, so that domain experts can edit and adjust the results. The development of semi-automatic content mapping and integration tools represents a new class of systems challenges, at the nexus of query processing, statistical and logical mapping techniques, and data visualization”. The disclosed teachings are aimed at realizing some of the advantages and overcoming some of the disadvantages noted herein.
4. References
The following documents provide background information helpful in understanding this disclosure, and to that extent, they are incorporated herein by reference. They are referred to, using the abbreviated notations shown below, in subsequent discussions to indicate specific relevance wherever necessary.
(1) U.S. Patent Documents[U1] U.S. Pat. No. 5,826,258Gupta et al.1998[U2] U.S. Pat. No. 5,841,895Huffmann1998[U3] U.S. Pat. No. 5,860,071Ball et al.1999[U4] U.S. Pat. No. 5,898,836Freivald et al.1999[U5] U.S. Pat. No. 5,913,214Madnick et al.1999[U6] U.S. Pat. No. 5,983,268Freivald et al.1999[U7] U.S. Pat. No. 6,102,969Christianson et al.2000[U8] U.S. Pat. No. 6,128,655Fields et al.2000