Information in Web pages and formatted documents are designed for human consumption and, hence, exhibit some visual pattern. The document author communicates this abstract visual pattern to the web browser using a specification language (for example, hypertext markup language (HTML), cascading style sheets (CSS), Javascript Code, etc.). Humans typically do not look at the specification language to understand the data. Rather, they look at the rendered version of the page through a browser. However, existing rule-based information extraction (IE) frameworks do not deal with visual representations of a page. Instead, existing approaches look for patterns in the specification language. Thus, any rules that intend to exploit the visual cues in the layout need to be translated into equivalent rules based on the source code of the page.
As such, existing IE approaches have serious limitations, including, for example, the following. An abstract visual pattern can be implemented in many different ways by the web designer. For example, a tabular structure can be implemented using any of <table>, <div> and <li> tags, and only a fraction of tables are implemented using the <table> tag. Source-based rules that use layout cues need to cover all possible ways in which the layout can be achieved. A rule that relies on a specific implementation will fail on pages that use a different implementation, even if these pages exhibit the same visual pattern.
Also, with existing approaches, the proximity of two entities in the HTML source code does not necessarily imply visual proximity, and so it may not be possible to encode visual proximity cues using simple source based rules. Additionally, rules based on HTML tags and document object model (DOM) trees are often sensitive to even minor modifications of the web page, and rule maintenance becomes messy.
Further, challenges exist in pure text-based information extraction systems. For example, specification languages are becoming more complex and difficult to analyze. Also, visualization logic in Javascript and CSS prevent text based analysis. Further, there can be errors in the markup code, but browsers can still render the page accurately in most cases, and spatial layout based rules would be more robust to these kinds of errors.