The present invention is directed to a method and apparatus for extracting data from data sources on a network and, more particularly, to a method and apparatus for learning general data extraction heuristics from known data extraction programs for respective data sources to obtain a general data extraction procedure.
Computer networks are widely used to facilitate the exchange of information. A network may be a local area network (LAN), a wide-area network (WAN), a corporate Intranet, or the Internet.
The Internet is a series of inter-connected networks. Users connected to the Internet have access to the vast amount of information found on these networks. Online servers and Internet providers allow users to search the World Wide Web (Web), a globally connected network on the Internet, using software programs known as search engines. The Web is a collection of Hypertext Mark-Up Language (HTML) documents on computers that are distributed over the Internet. The collection of Web pages represents one of the largest databases in the world. However, accessing information on individual Web pages is difficult because Web pages are not a structured source of data. There is no standard organization of information provided by a Web page, as there is in traditional databases.
Attempts have been made to address the problem of accessing data from Web pages. For example, information integration systems have been developed to allow a user to query structured information that has been extracted from the Web and stored in a knowledge base. In such systems, information is extracted from Web pages using special-purpose programs or xe2x80x9cwrappersxe2x80x9d. These special-purpose programs convert Web pages into an appropriate format for the knowledge base. In order to extract data from a particular Web page, a user must write a wrapper, which is specific to the format of that Web page. Therefore, a different wrapper must be written for the format of each Web page that is accessed. Because data can be presented in many different formats, and because Web pages frequently change, building and maintaining wrappers and information integration systems is time-consuming and tedious.
A number of proposals have been made for reducing the cost of building wrappers. Data exchange standards such as the extensible Markup Language (XML) have promise, but such standards are not yet widely used. In addition, Web information sources using legacy formats, like HTML, will be common for some time, and therefore, extraction methods must be able to extract information from these legacy formats. Special languages for writing wrappers and semi-automated tools for wrapper construction have been proposed, as well as systems that allow wrappers to be trained from examples. However, none of these proposals eliminate the human effort involved in creating a wrapper for a Web page. Moreover, the training methods are directed to learning a wrapper for Web pages with a single, specific format. Consequently, a new training process is required for each Web page format.
More particularly, when a learning system is used, for example, it is necessary for a person to label the samples given to the learning algorithm. More particularly, a user must label the first few items that should be extracted from the particular Web page starting from the top of the page. These are assumed to be a complete list of items to be extracted up to this point. That is, it is assumed that any unmarked text preceding the last marked item should not be extracted. The learning system then learns a wrapper from these examples, and uses it to extract data from the remainder of the Web page. The learned wrapper can be used for other Web pages with the same format as the page used in training. Therefore, in the learning system, human input is required to determine the page-specific wrapper.
These problems are not limited to retrieving data from HTML documents. These problems exist for documents found on any network.
Therefore, a general, page-independent data extraction procedure was needed to enable a user to easily and accurately extract data from data sources having many different formats. Additionally, an improved format-specific data extraction procedure was needed to accurately extract data from data sources. A procedure was also needed for determining a ranked list of possible data extraction procedures available for accurately extracting data from a data source. The present invention was developed to accomplish these and other objectives.
In view of the foregoing, it is a principal object of the present invention to provide a method and apparatus which eliminates the deficiencies of the prior art.
It is a further object of the present invention to provide a method and apparatus for learning general data extraction heuristics to generate a general data extraction procedure to enable a user to extract data from a data source on a network, regardless of the format of the data source.
It is another object of the present invention to provide a method and apparatus for learning a general data extraction procedure and for using this procedure to learn a format-specific wrapper.
It is yet a further object of the present invention to provide a method and apparatus for generating a ranked list of wrappers available for accurately extracting data for a particular data source on a network.
These and other objects are achieved by the present invention, which according to one aspect, provides a method and apparatus for learning a general data extraction procedure from a set of working wrappers and the data sources they correctly wrap. New data sources that are correctly wrapped by the learned procedure can be incorporated into a knowledge base.
According to another aspect of the present invention, a method and apparatus are provided for using the learned general data extraction heuristics for the general procedure to learn specific data extraction procedures for data sources, respectively.
According to yet another aspect of the present invention, a list of possible wrappers for a data source is generated, where the wrappers in the list are ranked according to performance level.