The following relates to the document processing arts. It is described with example reference to applications involving the detection and delineation of cells in tables having separating gridlines. However, the following is applicable more generally applicable to detection and segmentation of gridlines and rectangles generally, and to apparatuses, methods, applications, and so forth employing same.
Document analysis relates to processing of documents to extract useful information. Table or tabular processing is an important area of document analysis. Tables or tabular presentations may contain valuable information such as quantitative results, synthesis, correlation, or other presentation of factual information, or so forth. Automated analysis of tabular information is difficult, however, because the information is typically grouped into table cells whose recognition depends upon spatial location in the document, relative alignment of cells with other cells, and similar layout-based considerations. In contrast, document analysis techniques tend to focus upon textual analysis that is typically relatively independent of document layout.
In some tables or tabulations, the cells are delineated by horizontal and vertical gridlines. These gridlines beneficially guide the eye of the human viewer to identify individual cells and, in some arrangements, selected groups of cells. Some automated table analysis techniques employ image analysis of such gridlines to assist in identifying table cells. For example, a gridline identified by image analysis may be taken as an indication of a boundary between table cells.
The robustness and reliability of such image analysis-based cell identification techniques has been limited by uncertainties in the image analysis, such as in thresholding typically used to distinguish gridline pixels from surrounding pixels. Moreover, deviations of the table grid from an ideal Cartesian grid-type layout can be problematic. For example, in some cases a group of cells may be merged across two or more rows, two or more columns, or so forth so that the merged cell does not “line up” with the general layout of rows and columns of the table grid. Similar problems can arise if a cell is split into two or more rows, two or more columns, or so forth.
Moreover, the above image analysis-based approaches are typically not directly applicable to documents whose graphical content is stored in an abstract format such as portable document format (PDF) or scalable vector graphics (SVG) format. In such abstract vector-based representations, there are typically many different (that is, redundant) ways for a given table grid to be represented. For example, each minimal cell (that is, each cell that does not contain any sub-cells) may be represented by four boundary vectors, with vector redundancy at each table cell boundary. Alternatively, horizontal vectors extending across all columns of the table may represent gridlines separating table rows, and similarly vertical vectors extending across all rows of the table may represent the gridlines separating table columns. This representation has no vector redundancy, but also does not have one-to-one correlation between grid vectors and individual minimal table cells. There are many other possible grid representations with various levels of vector redundancy.
One approach for processing documents stored in an abstract graphical representation such as SVG or PDF is to convert the abstract graphical content into a bitmapped representation, and then to process the bitmap using the aforementioned image analysis techniques to identify gridlines and table cells. However, this approach is computationally inefficient due to the intermediate bitmapping process, and also introduces the aforementioned difficulties of image analysis-based techniques.