This invention relates generally to data extraction from structured documents. More particularly, this invention relates to the use of clustering and alignment algorithms in data extraction in order to minimize the need for operator input.
A need exists to pull only data from documents that combine data and presentation elements. Such documents may include Internet documents such as Internet pages. These documents may include the data for data fields described in more detail below and the data may be structured in HTML (HyperText Markup Language), a language that combines the data and the presentation information.
Many Internet pages having data may be included in a single web site. Nevertheless, the Internet pages may have similar, albeit slightly different, structures. The goal of a typical content aggregator is to retrieve, normalize and format the data for later use. The normalization and formatting of the data allows for greater control and presentation of the retrieved data. Such normalization and formatting may include storing the data in a form-field table.
A form-field table relating to storing items for an Internet shopping site may include fields such as name of item, description of item, and price of item. It should be noted that while the examples of this patent application deal primarily with data extraction for a content aggregator related to Internet shopping, the principles described and claimed herein may relate to any suitable content aggregator such as an intelligence system, a search engine etc.
A key element in any data extraction method is the process of identifying the location in source documents of the elements from which to extract the data. Most prior art data extraction systems from heterogeneous—i.e., substantially similarly formatted—structured documents, such as heterogeneous HTML documents, are based on regular expression, PERL (Pattern Extraction Report Language, a program for which it is easier to do pattern matching) or other scripting methods in order to identify those elements. A conventional scripting method may require writing a script. A script is directions how to access the information from a particular document or group of documents. For example, a script may be instructions to extract a particular piece of information by jumping to the fourth cell in the third column of a table on a given page. With respect to such a system, each site, and, in many cases each page, requires a separate script.
Those methods have several drawbacks: The process of defining the script is time consuming and labor intensive. Furthermore, such methods may require experienced personnel to define the scripts. Finally, the scripts are very sensitive to small changes in the source documents and cannot accommodate changes made to pages after the script has been written. Therefore, each introduction of a new page on a website, or alternatively, each introduction of a new structure for an existing page, requires that a new script be written, or at least adapted, to conform to the new page.
It would be desirable to provide systems and methods that extract data from documents in a way that is more efficient than conventional scripting methods.
It would also be desirable if such systems and methods could be adapted to be substantially automated in order to reduce the labor intensive quality of data extraction.