According to the generating process of a formatted document, a document is a collection of data and structures, specifically including content data, physical structure and logical structure. Content data refers to data such as text, images, graphs, etc. Physical structure is a description of the layout and combination of the content data in a page, and includes for example a text line, a text block, a chart, etc. Logic structure is a description of the information reflected by the content data and the relationship between the information, includes not only the logical attribute of a page element, such as a text paragraph, an abstract, a title, a table, etc., but also the hierarchical relation of the document and the logical relation between the document elements, such as the correlativity between an image and a cutline, etc.
Document analyzing is to extract the document physical structure, while document understanding is to establish mapping relation between the physical structure and the logic structure. For the document analyzing task, the available input is the final form of the document, neither the physical information nor the logical structure information is explicitly reflected, the logic model and physical model that are used in generating the document need to be reversely deduced, so as to maximally recover the physical and logic structure of the document. In practical applications, readability required by the mobile device makes the recovery of the physical and logical structure become a priority for all.
In the recovery of the physical and logic structure, it is possible to extract the logic structure information of the document from the page hierarchy, label the physical structure block extracted from the page according to its logic function. Currently, the page logical structure analysis based on a traditional image document is benefited from the development in the field of artificial intelligence. The development of the logic structure analysis is turning from a method based on priori rules to a method based on machine learning. Differs from a traditional image document analysis method, the information provided by a formatted document can assist the layout understanding. But in a fixed formatted document, there are a large number of spliced elements, as well as figure layers superimposed with each other. These data cannot be used to construct the logic structure of the document directly, but need to be operated, such as spliced, superimposed etc. according to spatial relationship; afterwards, the content showed by them can be determined. Classifying as well as recognizing and labeling the non-text objects in a page is one of the emphases of document understanding, among which, both analyzing and understanding the composite graph of a graph-text mixed arrangement layout are challenging.
Therefore, a new logic process technology on processing the composite graph in a formatted document is needed, which can perform an appropriate logic process to the composite graphs split from a formatted document, so as to make it is easy to perform layout understanding to the composite graph in a graph-text mixed arrangement layout in a formatted document, thus to avoid logic errors.