The greater part of Web contents are made available by documents which internal formats have been conceived for presenting documents on screen to human users (such as Web and PDF documents). In such documents, contents arrangement is designed to provide visual patterns that help human readers to make sense of document contents. So, a human reader is able to look at an arbitrary document and intuitively recognizing its logical structure, understand the various layout conventions and complex visual patterns that have been used in the documents presentation. This aspect is particularly evident, for instance, in deep web pages, where Web designers always arrange data records and data items with visual regularity, and in tables where the meaning of a table cell entry is most easily defined by the leftmost cell of the same row and the topmost cell of the same column (in Western languages). Documents that are conceived for presenting documents on the screen to users are referred to as presentation-oriented documents (PODs).
Approaches have been proposed to automatically access data from PODs for purposes such as automatic information extraction from web and PDF documents. Existing automatic information extraction approaches can be classified into two main groups: (i) approaches that mainly use the internal representation of deep web pages, and (ii) approaches that exploit the visual appearance of deep web pages.
Approaches based on the internal document representation depend from the HTML structure of deep web Pages. Such HTML-based approaches can be further classified as manual, semi-supervised and unsupervised. In manual approaches, a programmer finds patterns, expressed for example by XPath, from the page and then writes a program/wrapper that allows for identifying and extracting all the data records along with their data items/fields. Manual approaches are not scalable and are not usable in the current Web because of the very large number of different arrangement of data records in available deep web pages.
In supervised approaches based on HTML internal structure extraction rules are learned by using supervised machine learning algorithms from a set of manually labelled pages through a graphical user interface. Learned rules are used for extracting data records from similar pages. Such kinds of approaches still require a significant manual effort for selecting and labelling information in the training set.
Unsupervised approaches based on HTML internal structure exploit two main types of algorithms: instance and wrapper learning. Instance learning approaches exploit regularities available in deep web pages in terms of DOM structures for detecting data records and their data items. These approaches exploit unsupervised machine learning algorithms based on tree alignment techniques, hierarchical clustering, etc. Approaches falling in this category are strongly dependent from the HTML structure of deep web Pages.
In unsupervised wrapper learning approaches patterns or grammars are learned from a set of pages containing similar data records. In these approaches, pages used for generating or learning wrappers have to be found manually or by another system, then a set of heuristic rules based on highest-count tags, repeating-tags or ontology matching, etc. is used for identifying record boundaries. Furthermore many approaches falling in this category need two or more Web pages for generating the wrapper.
An analysis of many deep web pages reveals the following: (i) HTML is continuously evolving. When new versions of HTML or new tags are introduced, approaches based on previous versions have to be updated. (ii) Web designers use presentation features and spatial arrangement of data items for helping human user to identify data records. They do not take into account the complexity of underlying HTML encoding. Thus, (iii) the complexity of the source code of Web pages is ever-increasing. In fact, the final appearance of a deep web page depends from a complex combination of HTML, XML (XHTML), scripts (javaScript), XSLT, and CSS. (iv) Data records and pages are laid out either as lists or matrices where data items are indifferently organized in vertical or horizontal way. (v) The data records can be contained in non-contiguous portions of a Web page (multiple data regions). All of these aspects make it very difficult for existing approaches to learn instances and generate wrappers by using the internal encoding of Web pages and thus have strong limitations.
Visual-based approaches, such as LixTo, ViNTS, ViPERS, and ViDE, exploit some visual features of the deep web pages for defining wrappers. In LixTo, a graphical user interface showing a browser helps in manually designing the wrapper. In this case, the programmer doesn't have to write code, s/he can design the wrapper by using only mouse click on the target deep web page. The user visually selects data items and records, then the system computes HTML patterns associated to visual area selected by the user and writes a wrapper that allows for applying such patterns in similar pages. So, LixTo is essentially a supervised approach based on the HTML encoding of Web pages where examples are labelled by using a graphical user interface.
ViNTS uses visual features in order to construct wrappers that extract answers to queries on search engines. The approach detects visual regularities, i.e., content lines, in search engine answers, and then uses the HTML tag structure to combine content lines into records. ViPER incorporates visual information on a web page for identifying and extracting data records by using a global multiple sequence alignment technique. Both last two approaches are strongly dependent from the HTML structure of Web page, whereas visual information play a small role, which is a limitation. Furthermore ViPER is able to identify and extract only the main data region.
ViDE is the most recent visual-based approach. It makes use of the page segmentation algorithm ViPS. This algorithm takes in as input a web page and returns a visual block tree, i.e., a hierarchical visual segmentation of a web page in which children blocks are spatially contained in ancestor blocks. The algorithm exploits some heuristics in order to identify similar groups of blocks that constitute data records in which constituent blocks represent data items.
The ViDE approach suffers from several limitations. First, the approach strongly depends from the page segmentation algorithm ViPS, that in turn depends from the HTML encodings of Web Pages and from the set of assumptions made for segmenting Web pages. The ViPS algorithm attempts to compute a spatial representation in terms of visual blocks of a Web page by considering the document object model (DOM) structure and visual information of a Web page produced by the layout engine of a Web browser. In particular, a page segmentation algorithm strongly exploits the concept of a separator. Separators are identified, in ViPS, by heuristic rules that make use of weights experimentally set. The ViPS algorithm and the ViDE approach suffer when data records are spread in multiple data regions each contained in different page segments, and also when data records are arranged as a matrix.
An example of an existing patent related to this approach is U.S. Pat. No. 8,205,153 B2. This patent describes techniques for extracting information from formatted documents. Such techniques combine visual, mark-up, text-based, and layout-based rules for identifying information to extract form formatted documents. The method (i) makes use of geographical spatial databases typically adopted in geographical information systems, (ii) works only on web pages and not other presentation oriented formats, and (iii) does not make use of any semantic processing of document elements.
Some of the main problems that the existing approaches fail to address are: (i) the internal encoding of a POD may change frequently while the presentation remains essentially the same, and (ii) the same kind of object can be presented by using different layout conventions and presentation arrangements in different kinds of PODs.
Accordingly, it would be desirable to provide an improved method and system for automatically extracting objects from generic PODs. The method should allow a high level abstract description of objects to extract and the generation of wrappers independently from internal encodings and presentation arrangements of the PODs.