This invention relates generally to information retrieval and integration systems, and more particularly, to the creation and generation of wrapper grammars for extracting and regenerating information from documents stored in networked repositories.
The World Wide Web (the xe2x80x9cwebxe2x80x9d or xe2x80x9cWWWxe2x80x9d) is an architectural framework for accessing documents (or web pages) stored on a worldwide network of distributed servers called the Internet. An information source is any networked repository, e.g., a corporate database, a WWW site or any other processing service. Documents stored on the Internet are defined as web pages. The architectural framework of the web integrates web pages stored on the Internet using links. Web pages consist of elements that may include text, graphics, images, video and audio. All web pages or documents sent over the Web are prepared using HTML (hypertext markup language) format or structure. An HTML file includes elements describing the document""s content as well as numerous markup instructions, which are used to display the document to the user on a display.
Access to online information via the Web is exploding. Search engines must integrate a huge variety of repositories storing this information in heterogeneous formats. While all files sent over the Web are prepared using HTML format, the heterogeneity issue remains both in terms of search query formats and search result formats. Search engines must provide for homogeneous access (to the underlying heterogeneity of the information) and allow for homogenous presentation of the information found.
A wrapper is a type of interface or container that is tied to data; it encapsulates and hides the intricacies of a remote information source in accordance with a set of rules known as a grammar or a wrapper grammar, providing two functions to an information broker. First, wrappers are used to translate a client query to a corresponding one that the remote information source will understand. Wrappers are associated with the particular information source. For example, HTTP wrappers interact with HTTP servers and HTML documents; JDBC wrappers work with ODBC-compliant databases; and DMA wrappers work with DMA-compliant document management systems.
Second, wrappers are used by search engines to extract the information stored in the HTML files representing the individual web pages; the wrapper scans the HTML files returned by the search engine, drops the markup instructions and extracts the information related to the query. If an information broker is involved, the wrapper parses (or processes) the results in a form that can be interpreted and filtered by the information broker. Then the wrapper takes the search answers, either from the different document repositories or from the information broker, puts them in a new format that can be viewed by the user. Extraction and parsing is done in accordance with the grammar or rules for the particular type of response file.
Unfortunately, document repositories and providers HTML"" response files are generated for the convenience of visualization rather than information extraction. Moreover, response files from different information providers vary widely both in structure and in format: HTML, ODBC, DMA. Even among HTML providers, the format may vary. For example, some providers may generate HTML tags to separate each attribute of the document (author, title, journal, and date of publication). Other providers may link attributes, such as author and title, together, separating them not by an HTML tag, but by a grammatical separator such as a comma or semicolon.
As a result, the analysis of response files and the creation of wrapper grammars in most search engines require human intervention. As the Web providers evolve over time and as individual documents may change over time, human intervention is also needed each time the response structure or markup is changed. This makes the process of the wrapper grammar creation and maintenance extremely time-consuming and error-prone.
Automatic induction (generation) of wrapper grammars has been studied in the literature. For example, Chidlovskii et al, xe2x80x9cTowards Sophisticated Wrapping of Web-based Information Repositories,xe2x80x9d Proc. Int""l RIAO""97 Conference, Montreal, pp. 123-135, 1997, describe a semi-automatic approach for wrapping of Web-based information repositories involving high-level text-processing tools based on grammar rules. While this method allows processing of any regular search result by a high-level grammar, it is not HTML oriented and thus prone to errors or stopping mid-analysis.
N. Kushmerick, Wrapper Induction for Information Extraction, Ph.D. Dissertation, Dept. Computer Science and Eng., University of Washington, Seattle, Wash. and Wrapper Induction; Efficiency and Expressiveness, AAAI""98 Workshop on AI and Information Integration, AAAI-98, identified some subclasses of HTML wrapper grammars which can be efficiently inferred. These particular subclasses assume a tabular structure of items on the response page. The wrapper grammar inference is therefore reduced to the efficient detection of tag sequences preceding each attribute in such a tabular form.
I. Muslea et al, STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources, AAAI""98 Workshop on AI and Information Integration, 1998 considered a wider set of HTML wrapper grammars. This method goes beyond tables and also induces wrapper grammars in cases when some attributes are missing or their appearance order changes on the response page.
Despite a reported success in about 65% by N. Kushmerick and 75% by I. Muslea et al. of real information providers, both approaches have obvious limitations, for example, in treating disjunction (A or B) and xe2x80x9clist of listxe2x80x9d or xe2x80x9cnested listsxe2x80x9d cases.
In addition to the limitations of the above approaches, two main problems affect automatic wrapper grammar generation. The first problem is the ambiguous markup of response pages by some Web-based information providers, which makes automatic wrapper grammar generation difficult at best and in some cases impossible. For example, a Web provider reports a list of answers, with each item containing two string-value attributes t1 and t2, with t2 being optional. Additionally, the provider may use a unique format for each attribute. That is, the HTML file structure of a response page from this Web provider looks as follows: ( less than i greater than String (t1) less than i greater than ( less than i greater than String (t2) less than /i greater than ))+. Assume the wrapper grammar has been generated, it has correctly guessed this format and it receives the following response from the provider:  less than i greater than string1 less than /i greater than  less than i greater than string2 less than /i greater than . While the wrapper grammar uniquely assigns string1 to attribute t1, recognizing string2 is ambiguous; the wrapper grammar may assign string2 to either attribute t1 or attribute t2 (nondeterministic choice). Clearly, such a behavior is unacceptable for correct attribute extraction and t2 should therefore be excluded during the wrapper grammar generation.
The second main problem in automatically generated wrapper grammars is over-generalization. For example, a grammar like ( less than HTML greater than | less than /HTML greater than | less than body greater than | less than /body greater than | . . . |String)* will accept any HTML file, but it is incapable of properly assigning tokens (or specific values) to the defined user attributes (Title, Author, etc.). Over-generalization originates from a grammar inference mechanism which detects some common fragments in the sample input strings and generalizes them by merging them into a single attribute. Actually, over-generalization is related to inadequate or missing control over merges, which produces a general grammar that extracts more than the allowed for correct attribute.
There is a need for a method of automatic wrapper grammar generation that provides unambiguous attribute assignment. There is a further need for a method of automatic wrapper grammar generation which does not over-generalize fragments of strings into a single attribute. There is also a need for a method of automatic wrapper grammar generation that can generate wrapper grammars from disjunctive cases and list of lists cases. There is a need for a method of automatic wrapper grammar generation that can be used for any information source type format. There is a need for a method of automatic wrapper grammar generation that minimizes the need for human intervention. There is a need for a method of automatic wrapper grammar generation that provides for easy updating of the wrapper grammar when an information source revises its format.
A method for generating a wrapper grammar for a file having a structure of a particular format, according to the invention, includes providing at least one sample file of the particular format, wherein the particular format comprises a plurality of string tokens. While any particular type format file may be used (such as HTML, JDBC, DMA) with the invention, for convenience, only HTML files will be discussed hereafter. Each sample HTML file includes a plurality of tokens (data strings) which may be actual data from the document, an HTML tag or some other grammatical separator.
The sample file of the particular format is then processed by annotating attributable tokens with attributes from a set of attributes to generate an annotated sample set. A token is attributable if it can be assigned to an attribute. The tokens are, for example, annotated or labeled by assigning a user-defined attribute, such as Author, Title, etc., to those tokens for which such an attribute is defined by an appropriate user.
The annotated sample set is then evaluated to determine if wrapper grammar generation is possible, and if wrapper grammar generation is possible, a wrapper grammar for the files having a structure of the particular format is generated.
Preferably, the annotated sample set is evaluated by determining if all attributes in the annotated sample set are distinguishable from one another. Distinguishability is determined by generating a set of reverse prefixes for each attribute ti, partitioning the attribute set into equivalence classes di, where no two equivalence classes have common reverse prefixes, and if the equivalence classes are equal to the attributes, di=ti, then all attributes ti are distinguishable and automatic wrapper grammar generation is possible.
The method for generating a wrapper grammar according to the invention overcomes the problems of the manual approach to wrapper grammar generation, eliminating time-consuming and error-prone human intervention. The method for generating a wrapper grammar according to the invention uses techniques from grammatical inference and machine learning and can be used with a much larger class of wrapper grammars than those considered above.
A grammar is a set of rules that together define what may be xe2x80x9cspokenxe2x80x9d or displayed. In the context of the invention, the wrapper is the interface program that implements the grammar rules. For example, when the wrapper is compiled with a Java compiler compiler, which is a parser generator for use with Java applications, the Java compiler compiler converts the wrapper grammar to a Java program (i.e., the wrapper) that can recognize matches from response pages to the grammar.
A basis for the method of the invention is the assumption that a response page can be covered by a regular grammar. More precisely, a response page is assumed to be covered by a k-reversible regular grammar, where kxe2x89xa70. Although the class of reversible grammars is a proper subset of the regular grammar class, the difference between the two classes is minimal so the two can be considered the same for purposes of wrapper grammar generation. Moreover, it can be proven that attribute acceptors induced for the two classes (regular grammars and k-reversible grammars) have the same expressive power. Thus, applying the method of the invention to real Web information providers to generate wrapper grammars based on regular grammars, wrapper grammars are successfully generated nearly 100% of the time.
The method is incremental; it does not require the annotated samples to be necessarily complete, that is, representing all structural elements of the responses. The method is capable of refining the wrapper grammar each time a new HTML response contains a structural element not used in previously processed HTML responses.
The method of wrapper grammar generation according to the invention overcomes the problems of both ambiguity and over-generalization. Ambiguity and over-generalization are eliminated by detecting a proper level of commonness between different fragments of sample strings. Thus the depth of merge operations is determined in such a way that the wrapper grammar generated automatically will accept all strings similar to those in the sample set and, at the same time, will be able to recognize tokens corresponding to use-defined attributes.
Moreover, the method of wrapper grammar generation according to the invention is capable of generating a wrapper grammar in partial cases, i.e., when only some of the user-defined attributes can be distinguished by the automatic grammar generation. In this case, the method provides a partial solution; it merges attributes it cannot recognize into a newly defined joint attribute and then completes the automatic grammar generation. The automatically generated wrapper grammar will recognize the individual attributes, if present, or the joint attribute if present. For example, if user attributes Journal and Volume are not distinguished by the automatic wrapper grammar generation, they can be substituted by a joint attribute called Reference, which will be used in the automatic grammar generation. When the wrapper grammar is used to scan HTML files, if the particular file contains separate tokens for Journal and Volume, they will be found. If only a single token is found, then it will be reported as the attribute Reference.