1. Field of the Invention
This invention relates to information wrapper generating methods and more particularly to machine learning method for wrapper construction that enables easier generation of wrappers.
2. Description of the Related Art
With the expansion of the Web, computer users have gained access to a large variety of comprehensive information repositories. However, the Web is based on a browsing paradigm that makes it difficult to retrieve and integrate data from multiple sources. The most recent generation of information agents (e.g., WHIRL, Ariadne, and Information Manifold) address this problem by enabling information from pre-specified sets of Web sites to be accessed via database-like queries. For instance, the query xe2x80x9cWhat seafood restaurants in L.A. have prices below $20 and accept the Visa credit card?xe2x80x9d may be considered as an example. Assume that there are two information sources that provide information about L.A. restaurants: the Zagat Guide and L.A. Weekly (see FIG. 1). To answer this query, an information agent could use Zagat to identify seafood restaurants under $20 and then use L.A. Weekly to check which of these accepts Visa.
Information agents generally rely on xe2x80x9cwrappersxe2x80x9d to extract information from semistructured Web pages. A page is semistructured if the desired information can be located using a concise, formal grammar. Each wrapper consists of a set of extraction rules and the code required to apply those rules to the semistructured Web pages. Some systems, such as TSIMMIS and ARANEUS depend on humans to write the necessary grammar rules. However, there are several reasons why this is undesirable. Writing extraction rules is tedious, time consuming, and requires a high level of expertise. These difficulties are multiplied when an application domain involves a large number of existing sources or the format of the source documents changes over time. It would be much more advantageous to have information agents that could accommodate the flexibility and spontaneous nature of the Web so that desired information can be gleaned from any format of presentation.
Research on learning extraction rules has occurred mainly in two contexts: creating wrappers for information agents and developing general purpose information extraction systems for natural language text. The former are primarily used for semistructured information sources, and their extraction rules rely heavily on the regularities in the structure of the documents; the latter are applied to free text documents and use extraction patterns that are based on syntactic and semantic information.
With the increasing interest in accessing Web-based information sources, a significant number of research projects depend on wrappers to retrieve the relevant data. A wide variety of languages have been developed for manually writing wrappers (i.e., where the extraction rules are written by a human expert), from procedural languages and Perl scripts to pattern matching and LL(k) grammars. Even though these systems offer fairly expressive extraction languages, manual wrapper generation is a tedious, time consuming task that requires a high level of expertise. Furthermore, the rules for the wrappers have to be rewritten whenever the sources suffer format changes. In order to help the users cope with these difficulties, Ashish and Knoblock proposed an expert system approach that uses a fixed set of heuristics of the type xe2x80x9clook for bold or italicized stringsxe2x80x9d.
The wrapper induction techniques introduced in WIEN (Kushmerick, 1997) are better fit to frequent format changes because they rely on learning techniques to generate the extraction rules. Compared to the manual wrapper generation, Kushmerick""s approach has the advantage of dramatically reducing both the time and the effort required to wrap a source; however, his extraction language is significantly less expressive than the ones provided by the manual approaches. In fact, the WIEN extraction language is a 1-disjunctive LA (landmark automaton, below) that is interpreted as a SkipTo( ) and does not allow the use of wildcards. There are several other important differences between STALKER (the present invention) and WIEN. First, as WIEN learns the landmarks by searching common prefixes at the character level, it needs more training examples than STALKER. Second, WIEN cannot wrap sources in which some items are missing or appearing in various orders. Last but not least, STALKER can handle EC (embedded catalog) trees of arbitrary depths, while WIEN""s approach to nested documents turn out to be prohibitive in terms of CPU time.
SoftMealy (Hsu and Dung) uses a wrapper induction algorithm that generates extraction rules expressed as finite transducers. The SoftMealy rules are more general than the WIEN ones because they use wildcards and they can handle both missing items and items appearing in various orders. The SoftMealy extraction language is a k-disjunctive LA, where each disjunct is either a SkipTo( )Next Landmark( ) or a single SkipTo( ). As SoftMealy does not use either multiple SkipTo( )s nor SkipUntil( )s, it follows that its extraction rules are strictly less expressive than STALKER""s. Finally, SoftMealy has one additional drawback: in order to deal with missing items and various orderings of items, SoftMealy has to see training examples that include each possible ordering of the items.
In contrast to information agents, most general purpose information extraction systems are based on unstructured text, and therefore the extraction techniques text are based on linguistic constraints. However, there are three such systems that are somewhat related to STALKER: WHISK, Rapier, and SRV. The extraction rules induced by Rapier and SRV can use the landmarks that immediately precede and/or follow the item to be extracted, while WHISK is capable of using multiple landmarks. But, similarly to STALKER and unlike WHISK, Rapier and SRV extract a particular item independently of the other relevant items. It follows that WHISK has the same drawback as SoftMealy: in order to handle correctly missing items and items that appear in various orders, WHISK must see training examples for each possible ordering of the items. None of these three can handle embedded data though all use powerful linguistic constraints that are beyond STALKER""s capabilities.
The present invention provides means by which extraction rules for wrappers may be automatically generated when correct examples have been provided previously. Using a graphical user interface, a user marks or indicates information that is desired from a realm of similar data collections. For example, if one set of Web pages is marked for addresses, the graphical user interface (GUI) transmits or passes the relevant token sequences identifying the borders, perimeters, and/or prefix/suffix of the indicated portion to a rule-generating program/system denominated herein as STALKER. STALKER then takes these collections of token sequences in the context that they identify certain data fields of interest, in this case addresses. STALKER then takes the examples and generates rules by means of the token sequences and derivatives thereof in order to determine extraction rules for wrappers.
This process of extracting rules for wrappers is highly advantageous as the wrappers are then able to go out to other data collections, such as other Web pages, and extract the address or other desired information. This makes available the coherent, controlled, predictable and facile operation-generation of information agents. Such agents can be unleashed upon data collections to extract the desired information. An information automaton is then achievable that may allow the user to gather information from an identified and semi-structured source. While suffering some limitations, the present invention may provide a stepping stone to an ultimate goal of harvesting information from unpredictable, but stable, information sources such as the Internet itself. The user does not need to know the range or extent of the information base, just that information is present and that a utility can be achieved by which information of interest can be extracted and returned to the user. The user then has control over the information he or she wants and can choose almost any kind or type of information for return from the vast information reservoir that is the Internet, or other data collection.
The primary contribution of the present invention is to turn a potentially hard problemxe2x80x94learning extraction rulesxe2x80x94into a problem that is extremely easy in practice (i.e., typically very few examples are required). The number of required examples is small because the EC (embedded catalog) description of a page simplifies the problem tremendously: as the Web pages are intended to be human readable, the EC structure is generally reflected by actual landmarks on the page. STALKER merely has to find the landmarks, which are generally in the close proximity of the items to be extracted. In other words, given SLG (simple landmark grammar) formalism, the extraction rules are typically very small, and, consequently, they are easy to induce.
It is an object of the present invention to provide a system for generating extraction rules for wrappers or the like.
It is another object of the present invention to provide a wrapper rule generator that is easy to use.
It is yet another object of the present invention to provide a wrapper rule generator that is reliable.
It is yet another object of the present invention to provide a wrapper rule generator that can suffer exceptions or irregulates in data patterns.
It is yet another object of the present invention to generate a wrapper rule generator that has a high percentage of correct extractions.
These and other objects and advantages of the present invention will be apparent from a review of the following specification and accompanying drawings.