Tables and other spatially structured information on web pages contain a huge amount of visually explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition. Such web tables are easily discernable by human users by just looking at a rendered web page.
In contrast, the task of automatically extracting such information from web pages is difficult, because of HTML's design purpose to convey visual instead of semantic information. HTML does not explicitly contain the information in a way that is understandable to programs. Also, a multitude of different HTML implementations of web tables make it difficult to develop accurate and exhaustive rules to detect arbitrary web tables reliably.
Table extraction and interpretation are required by users that are interested in understanding the contents of a document. Other approaches included analysis of images of scanned documents, approximately calculating bounding boxes of objects, grouping in different classes and reconstructing the original intention of the author. Approaches to table extraction can be divided into two categories: top-down like [Nagy and Seth, 1984] and bottom-up like [Kieninger, 1998], depending on where the algorithms start. These approaches have the difficulty that the positional coordinates of individual boxes in the visual representation of the document are not deterministic and uniquely defined.
Known methods for extracting tables from web pages have focused on analyzing the source code of web pages. Penn et al. [Penn et al., 2001] defined genuine uses of HTML tables as document entities where the 2-D grid is semantically significant and described a couple of heuristics to distinguish genuine from non-genuine leaf <table> tables on web pages. Yalin Wang and Hu [Wang and Hu, 2002] trained a classifier on content features of individual cells and non-text layout features from the HTML source to perform the same task of table location. Chen et al. [Chen et al., 2000] employed heuristic rules to filter out non-genuine tables from their test set and make assumptions about cell content similarity for table recognition and interpretation. The method relied on the hierarchical HTML tag structure of the documents, most notably that of <table> tags. Yang and Luk [Yang and Luk, 2002] described how they extracted attribute-value pairs from 1-D or 2-D tables. Yoshida et al. [Yoshida et al., 2001] based their work on a general knowledge ontology and employed an expectation maximization algorithm to distinguish between attribute and value cells. They assumed that tables do no contain any spanned cells. Tengli et al. [Tengli et al., 2004] presented an algorithm that extracts tables and differentiates between label and data cells.
All these approaches have in common that they assume that relevant tables only appear inside leaf tables, which are such <table> tags that do not contain other nested <table> tags. In contrast, Lerman et al. [Lerman et al., 2004] mentioned that just a fraction of tables are actually created with <table> tags. In their algorithm, they leveraged the list page-detail page structure present in some websites to find boundaries between records in what the current inventor would classify as a substructured 1-D list. They also mentioned that layout is important for table extraction, but go on to say that this means that records are separated by HTML tags.
However, none of the existing approaches provide a way to locate, extract and interpret tables from arbitrarily formatted web pages. What is needed in the arts is a way to recognize tables on web pages similar to the way human observers do, by looking at the visual representation. In contrast, we base our information extraction on positional information that is independent of the HTML tag structure and do not rely on particular HTML structures being present.
Others have explored analyzing the visual representation of web pages for web page segmentation, web form understanding and as additional source for web information extraction.
Yang and Zhang [Yang and Zhang, 2001] described an approach which derives features directly from the layout of web pages. By using a “pseudo rendering process” they try to detect “visual similarities” of HTML content objects. Gu et al. [Gu et al., 2002] described a top-down approach to segment a web page and detect its content structure by dividing and merging blocks. Kovacevic et al. [Kovacevic et al, 2002a/Kovacevic et al, 2002b/Kovacevic et al, 2003/Kovacevic et al, 2004] used visual information to build up a “M-tree”, a concept similar to the DOM tree enhanced with screen coordinates. They then use further defined heuristics to recognize common page areas such as header, left and right menu, footer and center of a page. Cai et al. [Cai et al, 2003/Yu et al, 2003/Cai et al, 2003a/Cai et al, 2003b/US RPA 2005-0028077/US RPA 2006-0106798] described a web page segmentation process that uses visual information from Internet Explorer. Their VIPS algorithm segments a DOM tree based on visual cues retrieved from the browser's rendition. Cosulschi et al. [Cosulschi et al., 2004] described an approach that uses positional information of DOM tree elements to calculate block correspondence between web pages.
In information extraction literature, Zhao et al. [Zhao et al., 2005], Zhai and Liu [Zhai and Liu, 2005] and Simon and Lausen [Simon and Lausen, 2005] independently described approaches for detecting repetitive patterns (record boundary detection) on web pages. All these three approaches are dominantly source-code based and enhanced with visual cues. In contrast, Rosenfeld et al. [Rosenfeld et al, 2002/Rosenfeld et al, 2002/Aumann et al, 2006] described a system that works only on a hierarchical structure of the visual representation and learns to recognize text fields such as author or title from manually tagged training sets of documents. In contrast, our approach does not attempt to find individual text fields, but rather, larger structures, does not require training sets and neither imposes a hierarchical tree structure on the overall web page.
Cohen et al. [Cohen et al., 2002] mentioned “rendering” HTML code and using the results for detecting relational information in web tables. Their approach, however, does not actually render web pages, but rather infers relative positional information of table nodes in an abstract table model with relative positional information deduced from the source code. Nor does it mention the idea of using the calculated metadata information from rendering for interpretation. Nor does it observe that much metadata information is contained in word or text boxes, which are not physically existing as separate boxes in the DOM tree. In contrast, in [Kruepl et al., 2005] described a top-down web table location mechanism working exclusively on visual information obtained from the Mozilla web browser. The approach worked on word bounding boxes after manipulation of the DOM tree. Tables were detected by first determining these visualized words and then grouping them together with the help of space density graphs and recursive application of an existing the X-Y cut algorithm. This approach was later adapted in [Kruepl and Herzog, 2006] to a bottom-up clustering algorithm starting with word bounding boxes as well. The problem with this approach is that it has difficulties with deducing the individual logical cells of tables and their relative logical relation between each other (the logical table model). Also, visual metadata information visible to the human observer such as background colors which is relevant for interpreting tables, is lost in the process.
None of the existing approaches provide a way to locate, extract and interpret tables from arbitrarily formatted web pages.
None of the existing approaches eliminate some of the difficulties of clearly recognizing the individual units of tables and respective relation to each other. And at the same time, none of the existing approaches provide a way to retrieve metadata information of text (like bold, size 14) which allows interpreting the reading order and as such the information contained in web tables.