The following relates to the graphical processing, document processing, information processing, and related arts. It finds example application in extracting structural layout of tables, and is described with particular reference thereto. The following finds more general application in determining structural layouts of rectangular cells of tables, grids, line art objects or representations, and so forth.
Tables are common elements in documents, and the contents of such tables typically contribute substantially to the informational content of the document. The information content of a table is often intimately related to its layout. For example, every entry in a column of a table may store a price value, while entries in another column may store item number, item name, or so forth. Accordingly, it is advantageous to determine and utilize the structural layout of the table in conjunction with extracting and interpreting the information content of the table. For example, the content may be interpreted on a row-by-row basis, or on a column-by-column basis, or so forth.
In document conversion applications, a document is converted from a source format, such as portable document format (PDF), to a more structured format such as extensible markup language (XML), hypertext markup language (HTML), or so forth. In performing such a conversion, it is advantageous to extract and retain the logical layout of a table for use in structuring the document. Such extraction can however be difficult, because different tables use different spatial layouts. For example, some tables include a line- or vector-based grid containing each cell of the document, with the topmost row of grid elements containing column headers. In other tables, the column headers are above and outside of the line- or vector-based grid. Moreover, some cells may be split or merged, so that the table deviates from a canonical row-by-row and column-by-column format. Indeed, some tables deviate strongly from such a canonical format, and include sub-rows, sub-columns, or other structures.
Some tables include line- or vector-based gridlines that provide the reader with a guide for following rows and columns of the table. In some automated table reading approaches, these line- or vector-based gridlines are ignored, and a purely text-based analysis is performed. Such a text-only approach will lose the spatial layout information typically provided by the gridlines. However, extracting useful information about the logical layout of the table from the gridlines has heretofore been difficult.