1. Field of the Invention
The present invention relates to electronic documents. More specifically, the present invention relates to a method and an apparatus for identifying white space tables within a document.
2. Related Art
Documents often include numerous formatting types, from plain text, to complex tables, to pictures and multimedia. Moreover, it is common for these documents to include multiple content formats in the same document.
Hence, in order to display documents correctly to a user, the display program needs to know how the content in the document is formatted. Some documents currently use a hierarchy of tag elements to represent the structure of the document, wherein each section of the document is represented by a tag that includes information about the section's content type (for example, whether it is plain text, a table, or any other format.) This becomes especially important if the document is re-flowed to fit a different display size, or is repurposed to a different document format.
Often times, documents are created in one program, and then converted to a new format more suitable for distribution on the Internet. During this conversion process, the converter needs to correctly identify the formatting types within the document. White space tables, which are comprised of text arranged in rows and columns, where the rows and columns are separated strictly by bands of white space rather than horizontal and vertical lines, can be relatively hard to differentiate from other formatting types. If the identification and tagging process does not recognize the white space tables correctly, the tabular information could be reordered or lost when the document is re-flowed or repurposed. For example, after re-flowing the document, white space between columns might disappear and chunks of text might be positioned extremely close to each other without clear bounding space. This can cause various problems for visually-disabled users who might not be able to use the tabular data in a meaningful way. In addition, when repurposed, the text could get rearranged and become unreadable.
Existing methods for detecting white space tables typically employ weighted learning algorithms, which adjust their weights based on detected white space tables. However, they typically do not correctly identify complex white space tables, or the rare cases where the white space tables may be fairly sparse. Often times, existing methods incorrectly identify layout columns as white space table columns.
Hence, what is needed is a method and an apparatus for detecting white space tables without the problems listed above.