This invention relates generally to the processing of information during a document layout analysis in an optical character recognition (OCR) system, and more particularly to a method for detecting insets in the structure of a document page so as to further complement the document layout and textual information provided in an optical character recognition system.
A portion of the disclosure of this patent document contains material which is subject to copyright protection, Copyright(copyright) 1992-1997 Xerox Corporation. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention is a system for providing information on the structure of a document page so as to complement the document layout and textual information provided in an optical character recognition system. As described by A. Rahgozar and R. Cooperman in application Ser. No. 08/585,142, filed Jan. 11, 1996 U.S. Pat. No. 5,841,900, it is known to employ OCR systems to divide or segment a digitized document based upon the boundaries of the text, graphic or pictorial regions therein. Document layout analysis is a process by which the information regarding the organization of the document content, i.e. its structure, is extracted from the document image. The structure identifies document entity types (e.g. paragraphs, figures and tables), their properties (e.g. the number of columns in a table), and their interrelations (e.g., a figure is above a caption). Although OCR is an inherent part of this process, it is not intended as an aspect of the present invention. Document layout analysis includes identifying the sections of a page, identifying regions of the document that represent captions to images or equivalent sections on a document page, the identification of column boundaries and techniques for employing such layout information to fit pages of differing sizes into a common document format. Indeed, such information is important to state-of-the-art OCR systems that attempt to produce textual output in xe2x80x9creading orderxe2x80x9d (e.g., appropriately reflecting the flow of text in columns, headers, footers, captions and now insets). It will be further appreciated that the methodology presented here is equivalently applicable to any of a number of document types so long as they may be represented as a digital image divided into segments or regions according to content.
Heretofore, a number of commercially available OCR products have disclosed techniques that make use of page segmentation, and then preserve only the size and location of the text regions. For example, a product marketed under the mark Omnipage(trademark) outputs text regions as frames. Other OCR products are known to be limited to identification of a single section per page of a document. The typical system does not preserve section structure, and text flow. While, some systems may preserve sections under specific circumstances, for example, where the columns in a section did not overlap one another, none of the systems finds insets (regions not part of the text flow) within a document page.
In accordance with the present invention, there is provided a document layout analysis method for determining document structure data from input data including the content and characteristics of regions of at least one page forming the document, the method comprising the steps of: segmenting the regions within the page to identify regions characterized as text, graphics, and rulings; within the text regions, identifying those text regions representing headers footers and captions and for the remaining text regions analyzing text regions to identify and characterize certain text regions as insets; recomposing the text regions of the document; combining the text regions into columns; and determining boundaries of at least one column on the page.
In accordance with another aspect of the present invention, there is provided a document layout analysis method for determining document structure data from input data including the content and characteristics of regions of at least one page forming the document, the method comprising the steps of: receiving page data; segmenting the regions within the page to identify regions characterized as text, graphics, and rulings; within the text regions, identifying those text regions representing headers footers and captions and for the remaining text regions identifying those remaining text regions that are frame and credit insets; recomposing the text regions of the document; identifying, within the recomposed text, any center, column and stray insets; and recalculating the sections of the document so as to produce output data indicative of the reading order of text regions therein.
One aspect of the invention deals with a basic problem in optical character recognition systemsxe2x80x94that of preserving detail of the input document and in particular the flow of text (reading order) within the document. In particular, the present system is directed to a layout analysis system including inset detection that can be used to extend the capability of an OCR package to more accurately recreate the document being processed. Such a system produces output data for a word processor or a reading assistance device by preserving the reading order of the document to facilitate editability and a close approximation of the original appearance of the document.
This aspect is further based on the discovery of a technique that alleviates the text flow problems of traditional OCR systems. The technique provides information to supplement that generated by OCR processes. The technique employs methods to identify and characterize insets that may be found in a document image.
The technique described above is advantageous because it can be adapted to provide or supplement document layout data for any document processed by a system employing the techniques. The techniques of the invention are advantageous because they extend the capability of an OCR systemxe2x80x94providing the ability to closely approximate a document in a word processing or speech synthesis environment, and enabling editability and readability while preserving the document""s original appearance.