1. Field of the Invention
The present invention relates to the field of electronic document format conversion technologies, and in particular to a table recognizing method and a table recognizing system based on Probabilistic Graphical Models (PGM).
2. Description of the Prior Art
According to a generation process of typographic documents, a document is a set of data and structures, specifically comprising content data, a physical structure, and a logical structure. Document analysis refers to extraction of the physical structure of the document, whereas document understanding refers to construction of a mapping relationship between the physical structure and the logical structure. In practice, with respect to readability requirements of a mobile terminal, recovery of the physical and logical structures is significantly important. Table detection and recognizing on a document page is one of the critical issues in document understanding. The table, with an independent logical function, needs to be subject to physical segment and logical labeling. In a fixed-layout document, a table object may be constituted by numerous text elements (primitives) and operations, or may be integrally from one image graphic element.
Tables are an important part of a document. Therefore, it is especially important for analysis of a fixed-layout document as how to accurately recognize tables and contents in the tables. In the prior arts, some methods for recognizing and converting tables in the fixed-layout document are available. For example, the table in a PDF document is converted into an Excel table. To be specific, border coordinates of text blocks of the table in the PDF document are firstly recognized, row segment and column segment are performed for the table in the PDF document according to the border coordinates of the text blocks to acquire a plurality of segmented areas, a segmented area to which each of the text blocks belongs is determined, and the text blocks in the segmented areas are stored into the corresponding Excel table. In this way, the table with no border lines or incomplete border lines in the PDF document can be converted into an Excel table, without depending on the border lines of the table in the PDF document. This solution is defective in that detection of the border lines is a traditional rule-based table segmenting method, during recognizing of the border coordinates of the text blocks in the table, no other texts are allowed to be outside the table; otherwise, the texts outside the table may be mistakenly recognized as the text in the table. However, in a practical fixed-layout document, a larger number of logical blocks, for example, photograph, title and text body, may exist outside the table. Visually, most tables may not be obviously differentiated from the text body paragraphs, and the tables have diversified styles. As a result, a rule-based method may not recognize border lines of the table with other logical blocks. And thus the logical blocks of the table tend to be mistakenly recognized as a part within the table. Therefore, it is difficult to locate the practical table as an integrity, and the recognizing effect fails to satisfy actual requirements.