1. Field of the Invention
The present invention relates to a system and method of identifying web page semantic structures.
2. Introduction
In spite of recent progress on the semantic web and interchange formats like XML, most available web pages today are still written in HTML and designed mainly for humans and not machines to read. Information conveyed on HTML pages is carried not only by their stream of texts, but also by the layout of the web pages. For instance, the web page in FIG. 2 consists of a form and a horizontal menu on the top, a heading-content visual block and a vertical menu on the left, and several heading-content and normal-content descriptions in the center. Humans can easily recognize this structure by following visual clues and language clues. A variety of web-based applications have begun to exploit web page semantic structures. For example, web page layout extraction is a fundamental component of AT&Ts WebTalk, which is a framework for automatically constructing dialog systems using company websites. Others have used web page semantic structures for adaptively displaying web pages on small devices or to build a domain specific product extraction system such as DataRover, which is based on a web page segmentation algorithm.
However, automatically recognizing web page semantic structures is by no means an easy task. An HTML developer can choose using templates, white spaces, images, tables, dozens of HTML tags, hundreds of HTML attributes, or a combination of them to artistically lay out a page. HTML source codes for rendering the same web page could be dramatically different from one developer to another.
The Document Object Model (DOM) is widely used as the representation model of HTML documents. FIG. 3 shows a DOM tree fragment for the web page in FIG. 2. Several DOM-based heuristic algorithms have been developed for discovering the semantic structures of web pages. These algorithms are initiated by two key observations, which are:
First, contiguous leaf nodes on the DOM tree are semantically related if they have similar root-to-leaf tag paths. Based on this observation, researchers have developed a web page segmentation algorithm that takes the DOM tree as input and collects the root-to-leaf tag-path for each leaf node on the tree. A segment boundary is found, when the tag-path similarity between two contiguous leaf nodes is below a predefined threshold 6. Based on the same observation, a more complex algorithm has been proposed to group the leaf nodes in the DOM into a semantic partition tree. See, Saikat Mukerjee, GuiZhen Yang, WenFang Tan, I. V. Ramakrishman, “Automatic Discovery of Semantic Structures in HTML Documents”, ICDAR 2003, incorporated herein by reference.
Second, semantic blocks on a web page are often separated by visual separators such as lines, blank areas, images, font sizes, colors, etc. A Vision-based Page Segmentation (VIPS) algorithm has been proposed to detect the semantic content structure in a web page. VIPS makes use of the DOM structure as well as visual cues of DOM tree nodes including position, color, font size, font weight, etc. A list of heuristic rules are implemented to determine the visual blocks. An example of such a rule is dividing a DOM node if its background color is different from one of its children's.
These observations and the above discussed algorithms addressed part of the web page authors' intents for layout presentation. However, there are much more complex and richer cues remaining unemployed but intensively encoded by web page authors and used by web page readers such as language features, geometric cues, miscellaneous HTML attributes, etc. As a result, the effectiveness of these algorithms is only limited to some circumstances.
An HTML document is encoded through HTML tags (such as “<font>”), attributes (such as “color”), attribute values (such as “color=#003355”), as well as text (such as “ZOOM VARplus Program” in FIG. 2). HTML 4.01 Specification, incorporated herein by reference as well as updates to the HTML protocol, specifies 91 HTML tags and 119 attributes. They are used to govern the structure, the presentation of the rendered web page as well as the interactivity with the web page. Given this complexity, it is difficult if not impossible to develop a heuristic algorithm that is able to appropriately take into account this large number of contributing factors for the web page semantic structure.