Given the rapid growth of public information sources on the World Wide Web, it has become increasingly attractive to extract data from these sources and make it available for further processing by end users and application programs. Data extracted from Web sites can serve a variety of tasks, including information retrieval for business intelligence, event monitoring (e.g., news and stock market monitoring), and electronic commerce (e.g., shopping comparisons). For example, a company may extract performance specifications from the corporate Web sites of purveyors in order to choose components for its products.
Extracting semi-structured data from Web sites is not a simple task. Most of the information on the Web today is in the form of Hypertext Markup Language (HTML) or Portable Document Format (PDF) documents which are displayed by a browser or viewer. Given that the format of HTML documents is designed for presentation purposes, not automated extraction, and the fact that some of the HTML content on the Web is ill-formed due to incorrect coding, extracting data from such documents can be very difficult. While not malformed, PDF documents contain low-level coordinate information that is suitable for display but that makes automated extraction even more difficult. For example, trying to identify a table of data based upon its coding can be difficult in that while the browser may display data that lines up in rows and columns, there is nothing in the coding to indicate a table exists. This complicates table identification in that there is not an exact correlation between what is displayed by the Web browser and the coding which generated the display.
The most common way of extracting information from the Web is by generating a wrapper program. A wrapper program is usually handwritten code for extracting information from a specific document type. In other words, one type of wrapper program is written for HTML documents while another type is needed for PDF documents etc. Hand crafting of wrappers has many disadvantages such as being tedious, time consuming, and requiring extensive resources in order to maintain the wrapper once it has been created.
Wrapper programs are usually written with a priori knowledge of the structure of the Web page and the location of the data being extracted from the Web page. Some wrapper languages require the use of absolute HTML paths that point to the data item to be extracted. An absolute path describes the navigation down an HTML tree, starting from the top of the tree (<HTML> tag) and proceeding towards child nodes that contain the data to be extracted. The path is made absolute by the fact that it specifically delineates a specific path to the data by listing tag names expected to be seen in the tree and their positions. For instance, an absolute path to the third table, first row, and second column in an HTML document could be expressed as:
/HTML/BODY/TABLE[3]/TR[1]/TD[2].
However, the absolute path approach is likely to fail when the target HTML page changes. The most common change performed during Web site maintenance is changing the positioning of items on the page. New content (e.g. advertising) is frequently added to a page or existing content is moved to a new location on the page. This changes the absolute location of tags and renders useless the absolute HTML path which has been established. For this reason, it is important to establish the location of data items independently of their absolute paths. However, a wrapper program written with absolute paths no longer contains the formatting information of the document, which makes this impossible. This means it is necessary to constantly update the wrapper's absolute path for a particular Web site each time a change is made in the target Web page, a process which is both costly and time consuming.
A number of different approaches have been attempted to work around some of these problems. Gupta et. al. (5,826,258) attempted to organize disparate semistructured resources by providing a wrapper to extract information and provide structured information to a mapper coupled to a standard relational database engine. The occurrences of patterns in the semistructured information are cataloged by name and position in a nested structure. While this approach did not utilize a priori information, it still generated a wrapper program to access the attributes in the semistructured information as tuples for a relational database.
In a paper titled, “Learning Information Extraction Rules for Semi-structured and Free Text,” (University of Washington), an information extraction system is described which utilizes training sets to teach an information extractor what information is to be extracted. Examples are provided, both of “good” information to be extracted and “bad” information which should not be extracted to create the training set. The information extractor develops patterns based upon these examples and applies the patterns to new documents. This has the disadvantage of requiring the building of a training set which is time consuming and inflexible when encountering data which falls outside the spectrum of the training set.
Another approach described in a paper titled, “Conceptual-model-based Data Extraction from Multiple-record Web Pages,” (Brigham Young University), uses HTML tags in detecting record boundaries or sections. The HTML markups are then discarded in the actual data extraction phase. Unfortunately, this data extraction scheme focuses on unstructured documents that are data rich, but narrow in ontological breadth. In other words the data extraction works upon documents within a narrowly defined domain only. As described in the paper, the method was developed for extracting information from obituary articles. Obviously, this is of little value for most needs where the scope of information being extracted is not so narrowly defined.