Recently, an enormous number of computerized documents have been generated. These documents circulate over a computer network or are read by a reader such as a scanner. In view of these circumstances, it is desirable to have a way to utilize these documents. One method for utilizing such accumulated documents is text mining processing (that is, a kind of document search processing that automatically comprehends the summary of documents and comprehends a change with the passage of time and a trend in contents). Also, these documents may be utilized as original data for machine translation.
In order to utilize these accumulated documents, it is necessary to analyze the layout of the documents. As for typically circulated documents, such as a document uploaded in a home page, the layout is performed such that a human can easily comprehend visually. On the other hand, as for the documents computerized by a scanner, for example, a manuscript is a document of paper media, which is laid out according to a typical print format. These laid out documents include a title, a header, a list, a table, etc., as well as paragraphs corresponding to a body of the document, wherein a paragraph is often displayed in multiple columns such as a double column. Furthermore, a table may include not only elements written horizontally but also elements written vertically. For this reason, it is difficult to automatically analyze documents satisfactorily without considering the original document layout.
One method for analyzing layouts is to focus on spatial features. For example, when focusing on blanks, a paragraph following a blank line is estimated to be a paragraph.
Problems to be Solved by the Invention
However, there is a limit to the ability to extract significant text blocks by relying on spatial features. For example, comparing paragraph elements (i.e., text documents where sentences are typed in a massed area) with texts within a table, the usage of a blank is different. That is, when a blank character (or a blank according to a tab) is displayed in a line head, it is recognized to be the beginning of a paragraph, whereas a blank within a table is not usually arranged like this. Furthermore, when displayed in the list mode such as itemized lists, an indent may be inserted in a line head or a blank line may be inserted between lines. It is difficult to analyze these diversely laid out text documents uniformly by relying only on the presence of blanks.
In addition, even if a block of text is extracted based on the layout, a semantic evaluation of sentences (or a run of words) in that block is not performed. Accordingly, in case of elements such as tables, titles, lists, etc., which are not displayed as a massed text document such as a paragraph, the block may be separated. Thus, its meaning may not be perceived correctly.
In the case of advanced utilization of accumulated documents (e.g., text mining), it is necessary to discriminate the contents of documents automatically. However, important messages tend to be included in tables and lists (itemized lists) rather than paragraph elements. Conventionally, for an analysis of layouts based on spatial features, an analysis of elements such as tables and lists (itemized lists) has been abandoned due to the difficulty of their analysis (or their later utilization was difficult since the elements are segmented). However, when considering the later advanced utilization, important messages are more likely to be included in these elements such as tables and lists, so that it is desirable to extract them in such a form that is applicable to later semantics analysis.