1. Field of the Invention
The present invention relates to the technical field of data extraction, and, more particularly, to an example-based concept-oriented data extraction method that is adapted to free texts, such as the source code of web pages, or general articles.
2. Description of Related Art
Currently, with the development of information technology and the explosion of World Wide Web (WWW), more and more information is available online. There are a lot of web pages providing various useful information, such as weather forecast, stock quote, book price, and so on. However, these web pages are human-readable, but not machine-understandable. The information on the web pages is hard to be manipulated by machine. One way to handle web information more effectively is extracting data from web pages to populate databases for further manipulation.
The conventional approach for extracting data from web pages is to write application-specific programs, named wrappers, which are able to locate data of interest on particular web pages. But writing wrappers is a tedious, error-prone, and time-consuming process. Furthermore, the wrapper programs are usually not reusable once the formatting convention of the targeted web pages changes. In such case, the painful wrapper writing process has to be repeated.
Conventionally, many methods have been proposed to facilitate generating wrappers automatically or semi-automatically for solving the laborious and error-prone problems of handcrafting wrappers. These methods can be classified into two approaches. The first approach is developing languages specially designed to assist users in constructing wrappers. The other approach is using labeled examples to generate wrappers.
Although using specially designed languages to build wrappers can more or less reduce the effort, it still inherits the drawbacks of manually building wrappers with general purpose languages, such as Perl and Java. While in the example-based approach, it consists of two phases: rule induction and data extraction. In the rule induction phase, some possible contextual rules are generated to specify the local contextual patterns around the labeled data. Then, in the data extraction phase, these contextual rules are then used to locate and extract the targeted data on new web pages. This approach is based on an assumption that the inducted contextual rules are able to precisely locate the targeted data. However, due to the imperfect rule induction or insufficient examples, the inducted rules sometimes also locate undesired data. This kind of errors may propagate and make the data extractor fail to grab the targeted data, even though the contexts of the targeted data satisfy the contextual rules. Besides, in the prior-art, the representation form of contextual rules is predefined and the inducted rules must be strictly obeyed. As a result, the user must label a lot of examples so that all possible contexts around the targeted data can be taken into account.