The exemplary embodiment relates to the document processing arts and finds application in document conversion and structuring. In particular, it relates to the detection of captions having sequential features, such as numbers, and is described with particular reference thereto.
Techniques have been developed for converting documents in one format in which there is little or no document structure, to a structured format such as XML (extensible markup language), HTML (hypertext markup language), or SGML (standard generalized markup language). Typically, document conversion entails an initial conversion of the document to text fragments, which may be nested or otherwise organized, for example by paragraph, section, page, or the like. The document being converted typically also contains objects such as images, figures, gridded tables, and so forth which either cannot be represented as text fragments (as is typically the case for bitmapped images, for example) or are more appropriately represented as grouped objects (as is typically the case for gridded tables, for example). During conversion, objects that cannot be represented as text fragments are suitably stored in their native format, either embedded in the converted document or separately stored and linked to a suitable location in the converted document by a pointer or other link. Objects conducive to storage as grouped objects are grouped and stored as a grouped object (such as a table) that is suitably tagged.
To facilitate indexing, searching, structuring, or other organization of the converted documents, various automated techniques have been developed for recognizing parts of the document, such as page numbers, headers and footers, a table of contents, captions, and the like. Captions present a particular problem for document conversion. A caption is a textual element, such as a short explanation, annotation, description, legend, accompanying an illustration, such as an image, figure, or other object, and is typically converted as one or more text fragments during the initial document conversion processing. However, the caption is not a part of the general flow of text. Accordingly, if the caption is not recognized and addressed during document conversion, it causes an abrupt break in the reading flow, and additionally leaves its associated object unlabeled or unidentified.
Existing techniques for identifying captions have certain drawbacks. In one approach, the text fragment immediately below (or above) an object is assumed to be the caption for that object. A drawback of this approach is that it assumes that there is in fact a caption, and it further assumes a specific geometrical relationship between the caption and the associated object (such as below or above the object). The approach fails if either assumption is incorrect. Moreover, a caption that includes a contiguous group of text fragments may be misidentified by this approach. Another approach for identifying captions is to use a preselected keyword or other preselected heuristic to identify captions. For example, it may be assumed that a caption for a figure is any text fragment of the form “Fig. $ . . . ” where “$” is a placeholder indicating a number or other enumerator and “ . . . ” indicates any following text. A drawback of this approach is that it may be over-inclusive or under-inclusive and the assumptions involved in this approach lead to limited applicability and susceptibility to errors in identifying the captions. Current OCR engines fail both to recognize diagrams correctly (zoning issues), and to recognize the associated caption.
U.S. Pat. No. 7,852,499, issued Dec. 14, 2010, entitled CAPTIONS DETECTOR, by Hervé Déjean discloses a caption detector which is designed to recognize textual elements related to an image. The textual elements can include the caption itself, but may also other textual elements that form a part of the image. Particularly in the case of diagrams and technical illustrations, textual elements that form the caption are difficult to recognize.
Accordingly, there remains a need in the art for improved techniques for identifying or detecting captions.