1. Field of the Invention
The present invention relates to document segmentation apparatus and methods for dividing a document from content to content, and more particularly it relates to document segmentation apparatus and methods for dividing a document including a table or tables.
2. Related Background Art
In the past, information on the web has been presented in units termed “pages”, and the arrangement and dimension of the page can freely be set by the information presenter. Of course, the information presenter forms the pages on the basis of his or her information transmitting intention, but it is not necessarily the case that such pages meet the requirements of a reader.
Accordingly, even when a series of topics or subjects which are judged to have close relation by the presenter are gathered in one page, the reader may not want such relation, and, if only one of plural subjects is useful, information about the other subjects may be an obstacle when required information is retrieved. Particularly, in mobile equipment having an information presenting space, a function for displaying only required information is important.
Thus, it is important that documents to be displayed are divided into segments based on content (segmentation) in advance and that only a portion which is requested by the reader is presented. In almost all web pages, contents are written by using Hyper Text Markup Language (HTML), which is a language for use in composing web pages. Although HTML is a language for describing the structure of the document, it is difficult to describe details of theoretical structure by using HTML, and the main role of HTML is to designate the layout in the browser.
However, it is considered that the viewpoint of the information presenter is reflected in the layout of the page. Thus, there has been proposed a technique in which the page is divided on the basis of HTML tags in order to generate segments which reflect the intention of the information presenter.
In such a technique, a table, in the sense of a portion between the <TABLE>tag and the </TABLE>tag, is judged as one meaningful group and is formed as one segment. However, such a table frequently includes a plurality of sets of information which occupy a relatively great space.
Further, such “tables” can be categorized into tables in the general meaning of that word, and table formatting used for designating the layout of image or text. In the two cases, tags are used in quite different ways.
Furthermore, when the table formatting describes an actual table, a set of data is represented in a column or in a row, or there is a column (or row) with a given item name or not; that is, the table has various styles.