1. Field of Invention
The present invention generally relates to document processing. The method and apparatus of the present invention have particular application to extracting schematic information from a set of documents.
2. Discussion of Prior Art
The world wide web (throughout this specification,web, www, and world wide web are used interchangeably) is presently growing at an average of 1 million pages per day and is an amazing source of information. All of this information is buried in HTML documents authored by a wide variety of people with differing skills, culture, and purpose and using a wide variety of tools to author these pages. HTML does give structure to the authored documents, but mainly for viewing purposes. HTML has a fixed set of tags which are mostly used to enhance the visual appeal of the documents. Thus, it often happens that HTML pages for the same purpose have a different set of tags. For instance, people often mark-up their resumes in HTML on their home pages. Depending on the styles, these resume documents look significantly different from each other. This difference is acceptable for viewing purposes, but presents great difficulties to automatic programs which try to extract pertinent information from them.
The web is not just used for browsing anymore; automatic programs, like web crawlers, visit web sites and extract information to serve search engines, or push engines. Comparison shopping engines visit web sites describing similar information, such as prices, and extract semantic information from these sites. Given the format variances possible between topically related web pages, retrieved data is often unhelpful, unrelated or difficult to extract.
However, the present invention addresses this need to build search engines that allow users to formulate structural queries like xe2x80x9cfind a student with a Master""s degree and a GPA of 3.5 or more and skills in Java.xe2x80x9d The present invention allows the extraction of structural information buried in HTML pages which cater to the same topic but are authored with significantly different styles.
Some specific prior art related to the present invention is discussed below. These references describe methods of investigating the structure of documents and retrieving documents from large databases in response to user queries.
Two articles which describe attempts to discover structure from semi-structured data are xe2x80x9cIdentifying Aggregates in Hypertext Structuresxe2x80x9d, Proceedings of ACM Hypertext ""91, pp. 63-74, and xe2x80x9cStructural Properties of Hypertextxe2x80x9d, Proceedings of the Ninth ACM Conference on Hypertext, pp. 180-187, 1998. These attempts, however, focus on the organization of a set of hypertext documents by following their links rather than considering the schematic nature of the individual documents.
Three other articles describing related investigations are: xe2x80x9cInferring Structure in Semistructured Dataxe2x80x9d, Workshop on Management of Semistructured Data, 1997; xe2x80x9cExtracting Schema from Semistructured Dataxe2x80x9d, SIGMOD98, pp. 295-306; and xe2x80x9cDiscovering Association of Structure from Semistructured Objectsxe2x80x9d, IEEE Trans. on Knowledge and Data Engineering, 1999. However, these articles do not consider the schematic structure of individual documents or documents which have different schemas.
The patent to Driscoll (U.S. Pat. No. 5,694,592) teaches a method of querying and retrieving documents from a database using semantic knowledge about the query string to determine document relevancy.
The patent to Ishikawa (U.S. Pat. No. 5,848,407) describes a method of presenting potentially related hypertext document summaries to a user who is using a search engine that indexes a plurality of hypertext documents.
Whatever the precise merits and features of the prior art in this field, the earlier art does not achieve or fulfill the purposes of the present invention. The prior art does not provide for automatically identifying schematic structural and tag information from HTML documents and then converting these documents according to the extracted information.
The present invention describes a system and method that extracts keywords and structural information from hypertext or mark-up language documents (e.g. HTML) and then reformulates them as documents with a common structure and common set of tags. One underlying goal is to convert a collection of HTML documents, written in different styles, into XML documents following a common schema. XML, eXtended Markup Language, defines a web standard for describing schemas for different domains. For instance, one domain might be resumes, and a schema can be defined for describing all resumes. Thus all resume documents are written using the structure and tags described by this schema. Thereafter, keyword based search engines will be able to support queries and retrieve documents that are schematically and semantically closer to the information users are looking for. Using a five-stage process, the common schematic structures are discovered for the set of HTML documents authored in various styles. Prior domain knowledge regarding punctuation, keywords, synonyms and HTML tags is used to 1) break a document up into separate objects, 2) identify the objects corresponding to keywords, 3) regroup objects into hierarchical layers of abstraction, 4) logically order objects at the same level of abstraction, and finally 5) remove any non-keyword related information from the document""s discovered schematic structure. The discovered schema supports structural queries from search engines that locate data that are more semantically related to the requested information than data located by simple keyword searching.