Conventional algorithms exist that determine a given document's physical page layout, such as a given document's organization of distinct columns and sections of reading text. One conventional technique extracts a given document's page layout structure by analyzing the spatial configuration of word positions in the given document and graphically representing those word positions. An image of the document can be segmented by applying a recursive procedure to the graphically represented word positions. The original document's segmentation is indicated wherever a prominent gap exists in the graphically represented word positions. The recursive procedure iterates until no prominent gaps can be detected in the graphically represented word positions. Some attempts have been made to leverage information from such a recursive procedure to assist in reading-order text extraction.
Reading-order text extraction is the process by which text from a given document can be placed in the order the text is meant to be read (e.g. left-to-right, top-to-bottom) in lines of text that span across an entire document—as opposed to lines that only span across a column or section of text.
For example, if a given document, such as a newspaper, has two columns of text, then a human reader intuitively knows to read all the text in the left-most column before reading the text in the right-most column. The human reader thereby begins reading text in the right-most column only after reaching the last word in the bottom line of text in the left-most column. By applying reading-order text extraction to the given document, all the text from the right-most column will be placed across a page, left-to-right, and then text from the left-most column is placed on the page after text from the right-most column and no columns or sections will appear on the page.
The ability to accurately extract text in reading-order from segmented documents provides many advantages. Since many documents have different physical layouts, reading-order extraction allows for collecting and organizing text from all documents in a uniform layout. By maintaining the reading-order of the documents while discarding each document's various column and/or section breaks, search algorithms can better find keywords and/or semantic entities that appear in the extracted text.
Current conventional techniques suffer from a variety of deficiencies. Specifically, the recursive procedures used in current techniques fail to take into account character heights and widths that occur before and after the indications of segmentation from a given document. The failure of conventional techniques to take into account such character heights and widths is a critical deficiency as it leads to improperly characterizing line or paragraph breaks as identified segments or identified column breaks in a document.
When conventional techniques cast a paragraph break or line break that occurs within an actual column as the beginning of a new segment or another column, then there is a likelihood that text from the actual column will not be extracted in reading order since the conventional techniques will behave as though it is extracting text from two different columns with unrelated text.
Furthermore, the mechanism in current techniques for finding indications of segmentation from a given document, further fail to account for word starting frequencies, break alignment and/or relative straightness. Thus, conventional techniques often mischaracterize a given document's physical layout. Since conventional techniques risk finding incorrect segmentation of a document's physical layout, then the accuracy of proper reading-order text extraction will be suspect