The present invention relates to the field of automated information retrieval in the context of document processing. Particularly, the present invention relates to a system and associated method for discovering a majority schema from a set of related documents that share similar but not identical schemas.
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata, within the body of the hypertext markup language (HTML) document that defines the web pages. A computer software product known as a web crawler, systematically accesses web pages by sequentially following hypertext links from page to page. The crawler indexes the pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the user""s search terms, and returns the search of results in the form of HTML pages. Each search result includes a list of individual entries that have been identified by the search engine as satisfying the user""s search expression. Each entry or xe2x80x9chitxe2x80x9d may include a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
In addition to the hyperlink, certain search result pages include a short summary or abstract that describes the content of the URL location. Typically, search engines generate this abstract from the file at the URL, and provide acceptable results for URLs that point to HTML format documents. For URLs that point to HTML documents or web pages, a typical abstract includes a combination of values selected from HTML tags. These values may include a text from the web page""s xe2x80x9ctitlexe2x80x9d tag, from what are referred to as xe2x80x9cannotationsxe2x80x9d or xe2x80x9cmeta tag valuesxe2x80x9d such as xe2x80x9cdescriptionxe2x80x9d, xe2x80x9ckeywordsxe2x80x9d, etc., from xe2x80x9cheadingxe2x80x9d tag values (e.g., H1, H2 tags), or from some combination of the content of these tags.
Automatic programs, such as web crawlers also known as spiders or robots, visit the web sites and extract information. For example, comparison shopping search engines visit web sites describing information, such as prices, and extract semantic information from these sites. Given the format variances between topically related web pages, the retrieved data are oftentimes unhelpful, unrelated or difficult to extract.
The present invention addresses the need to build search engines that allow users to formulate structural queries like xe2x80x9cfind a student with a Master""s degree and a GPA of 3.5 or more and skills in Java.xe2x80x9d Heretofore, there is no fully adequate mechanism that allows the extraction of structural information buried in the web pages that cater to the same topic but are authored with significantly different styles.
Several attempts have been made to address this need, exemplary of which are the following references that generally describe methods of investigating the structure of documents and retrieving documents from large databases in response to user queries:
Rodrigo A. Botafogo, Ben Shneiderman, xe2x80x9cIdentifying Aggregates in Hypertext Structures,xe2x80x9d Proceedings of ACM Hypertext ""91, pp. 63-74.
IBM Almaden Research Center, xe2x80x9cAll searches start at Grand Central,xe2x80x9d Network World, front page, November 1997.
Tao Guan, Kam-Fai Wong, xe2x80x9cKPS: a Web Information Mining Algorithm,xe2x80x9d WWW8/Computer Networks 31(11-16): 1495-1507 (1999).
Seongbin Park, xe2x80x9cStructural Properties of Hypertext,xe2x80x9d Proceedings of the Ninth ACM Conference on Hypertext, pp. 180-187, 1998.
Svetlozar Nestorov, Serge Abiteboul, Rajeev Motwani, xe2x80x9cInferring Structure in Semistructured Data,xe2x80x9d SIGMOD Record 26(4): 39-43 (1997).
Svetlozar Nestorov, Serge Abiteboul, Rajeev Motwani, xe2x80x9cExtracting Schema from Semistructured Data,xe2x80x9d SIGMOD Conference 1998, pp. 295-306.
Ke Wang and H. Q. Liu, xe2x80x9cDiscovering Association of Structure from Semistructured Objects,xe2x80x9d IEEE Trans. on Knowledge and Data Engineering, 1999.
U.S. Pat. No. 5,694,592 to Driscoll describes a method of querying and retrieving documents from a database using semantic knowledge about the query string to determine document relevancy.
U.S. Pat. No. 5,848,407 to Ishikawa describes a method of presenting potentially related hypertext document summaries to a user who is using a search engine that indexes a plurality of hypertext documents.
However, the need for a system and associated method for discovering a majority schema (also referred to herein as common schema) from a set of related documents that share similar but not identical schemas has remained unsatisfied. For example, consider HTML documents, such as resumes, that describe the same concept but are marked up differently. Some authors may describe the degree by date, name, and the institute granting the degree, while other authors may describe the degree by name, institute granting the degree, and the date. Prospective employers searching for potential candidates may not pay attention to the order of description, and would rather have all degrees described in a conventional order. In addition, some candidates may include hobbies session in their resume, which information may be largely overlooked by employers. Briefly, prospective employers prefer to have a uniform view of the majority of the documents and search the repository of documents under such view.
Existing approaches do not offer a xe2x80x9cmajority schemaxe2x80x9d which is shared by most of the documents being searched, which presents a uniform and summary view of these documents and that can be used to guide the transformation of the HTML documents to a global schema in data integration. This need has heretofore remained unsatisfied.
The present invention teaches a schema discovery system and associated method that satisfy this need. In accordance with one embodiment, the system discovers a majority schema for a set of related and similarly marked up documents, such as HTML documents, based on the assumption that though the structure of these documents is mostly for visual purposes, the keywords used in the documents along with the structural tags provide some hints, and allow a rough sketch of the underlying intended schema. It is further assumed that albeit the set of HTML documents are marked up differently due to diverse authoring skills, they are closely related in content. Therefore, it is reasonable to assume the presence of a schema that can unify these different schemas, which schema is shared by the most (i.e., majority) of these HTML documents.
The copending U.S. patent application Ser. No. 09/531,019 generally describes a process that uses visual clues and structural tags to extract basic schematic structures of HTML documents. The present invention describes a method that reconstructs a majority of schemas from these schematic structures. It also proposes constraints-based mechanism for domain experts to specify domain knowledge, if any, that can help the reconstruction process. The algorithm used by the present system may be summarized by the following process:
1. Extracting Schematic Structures:
The schematic structures of markup documents are extracted and represented as sets of ordered trees with nodes labeled by a set of keywords inputted from the user. Keywords identify important concepts in these documents. Reordering rules are used to reconfigure the trees so that its structure more closely resembles the semantic structures of a predefined template, such as an HTML document.
2. Convert XML to Label Paths
The ordered trees are mapped to sets of paths, ignoring ordering and repetitive information. The assumption is that choosing an imprecise representation helps reveal common patterns.
3. Discover Frequent Label Paths
Prevalent patterns among the trees are presumed to be label paths that occur frequently among all the documents. A constraint mechanism is introduced for users to specify a restriction on the forms of schematic structures in the majority schema. This helps reduce the search space and to filter out noise. The set of frequent label paths satisfying the constraints are discovered.
4. Unify Similar Structures
Since the documents share similar but inexact schematic structures, there are repetitive structures among the discovered common tree structures. These repetitive structures are discovered by a clustering approach based on a simple intuitive notion of tree distance. The repetitive structures are then unified.
5. Convert Label Paths to DTD
The set of labels is converted to a predefined structure schema, such as XML DTD schema. Information lost in the inexact representation of trees can be recovered by heuristics.