1. Field of the Invention
This invention relates to improvements in finding the columns in a tabular document. More particular, this invention searches for column separations and only processes the line-intervals separating the word fragments in the table.
2. Description of the Related Art
A tabular document is a systematic arrangement of logically related entities that are mapped onto a layout structure based on simple linear constraints. By controlling the placement and format of each entity, these constraints provide the visual cues that help to identify the organization of a table content, i.e., its logical structure. The primary geometrical constraint imposed on a table is a linear placement of related entities. Other constraints include alignment and use of monospaced fonts in typesetting the table.
In documents containing a large number of similar records, entities of the same logical identity are typically placed along columns of a grid structure. To determine the locations of the columns, a conventional technique histograms the bitmap and searches the histogram for peaks. This method requires processing the characters and other artifacts into a bitmap.