1. Field of the Invention
This invention relates in general to data extraction process, and, in particular, to extracting data from structured documents.
2. Description of Related Art
With the fast growing popularity of the Internet and the World Wide Web (“WWW”), there is a growing demand for a technique of extracting information from different web sites, and storing the extracted information in a standard format. For example, a computer user may be interested in gathering information from the WWW about cars, luggage, or travel destinations. The user may wish to store this information in a user-defined format that allows the user to compare the attributes of each subject.
To illustrate, assume that a user is interested in gathering information about cars from several car-related web sites. More specifically, assume that the user is interested in gathering information on individual cars, including the manufacturer, the model, the year, the color, and the price.
Traditional techniques for solving this information gathering problem are typically based on knowledge of the structure used to arrange data within each specific web site. (The structure used to arrange the data within a page is commonly referred to as the syntax of the page.) These techniques require prior determination of the syntax of each page and storage of syntax information about each page in a data storage device, such as a database.
When gathering information about a subject from a particular page, the traditional techniques identify the attributes of the subject by comparing the structure of the page with the stored structure information. When there is a match, the traditional technique returns the attribute value to the user.
These traditional techniques are limited because they can only gather attribute values from a page when they know the syntax of a page. To put it differently, the traditional techniques can only gather attribute values when the syntax of a page has been previously determined and stored. Accordingly, traditional techniques are generally incapable of gathering information from redesigned or restructured web-pages or from new web pages. The traditional technique lacks syntax information about these pages. For both the redesigned or restructured web pages and the new web pages, the traditional techniques require effort and resources to determine and store information about their syntax before gathering attribute values. Determining the syntax can be time-consuming, and a large amount of storage space may be needed to store the syntax information.
Thus, there is a need in the art for an improved technique of extracting information from any web-pages and any other structured documents.