The following relates to the document processing arts. It particularly relates to document conversion and structuring techniques, and is described with particular reference thereto. However, the following relates more generally to automated document analysis and processing techniques.
There is continuing interest in document conversion to facilitate use of legacy documents and document databases. A given document is typically generated and utilized in a format that is appropriate for that type of document. For example, a text-based document may be generated and utilized in a word processing application format, while a table may be generated and utilized in a spreadsheet format, and so forth. Documents can be converted from one format to another in part or in its entirety. New application programs are continually being developed and revised, while older application programs become obsolete. The overall consequence is a large number of legacy documents in different formats, some of which may become less readily accessed as the underlying application programs, or earlier versions of such application programs, fall out of common use.
Document conversion is the process of converting current and/or legacy documents into a common format that is intended to be cross-platform compatible and less prone to obsolescence. If the common format is a structured format such as XML (that is, extensible markup language), HTML (that is, hypertext markup language), SGML (that is, standard generalized markup language), or so forth, then the document conversion also advantageously facilitates indexing, searching, structuring, or other organization of the converted documents or databases of documents. Typically, document conversion entails an initial conversion of the document to text fragments, which may be nested or otherwise organized, for example by paragraph, section, page, or so forth. The document being converted typically also contains objects such as images, figures, gridded tables, and so forth which either cannot be represented as text fragments (as is typically the case for bitmapped images, for example) or are more appropriately represented as grouped objects (as is typically the case for gridded tables, for example). During conversion, objects that cannot be represented as text fragments are suitably stored in their native format, either embedded in the converted document or separately stored and linked to a suitable location in the converted document by a pointer or other link. Objects conducive to storage as grouped objects are grouped and stored as a grouped object (such as a table) that is suitably tagged.
Captions present a known problem for document conversion. A caption, such as a short explanation, annotation, description, legend, or so forth accompanying an image, figure, or other object, is typically converted as one or more text fragments during the initial document conversion processing. However, the caption is not a part of the general flow of text. Accordingly, if the caption is not recognized and addressed during document conversion it causes an abrupt break in the reading flow, and additionally leaves the associated object unlabeled or unidentified.
Existing techniques for identifying captions have certain drawbacks. In one approach, the text fragment immediately below (or above) an object is assumed to be the caption for that object. A drawback of this approach is that it assumes that there is in fact a caption, and it further assumes a specific geometrical relationship between the caption and the associated object (such as below or above the object). The approach fails if either assumption is incorrect. Moreover, a caption such as an annotation that includes a contiguous group of text fragments may be misidentified by this approach.
Another approach is to use a pre-selected keyword or other pre-selected heuristic to identify captions. For example, it may be assumed that a figure caption is any text fragment of the form “Fig. $ . . . ” where “$” is a placeholder indicating a number or other enumerator and “ . . . ” indicates any following text. A drawback of this approach is that it may be overinclusive (for example, it would not be uncommon for a normal paragraph to begin with the aforementioned form “Fig. $ . . . ” if the paragraph references the figure) or underinclusive (for example, if the document uses a different format for captions than the pre-selected keyword or heuristic, such as using “Diagram $ . . . ” in place of “Fig. $ . . . ”). Again, the assumptions involved in this approach lead to limited applicability and susceptibility to errors in identifying the captions.
Accordingly, there remains an unfulfilled need in the art for improved and robust techniques for identifying or detecting captions.