As more and more users turn to computer networks such as the Internet and particularly the World Wide Web (hereinafter the “Web”) for information, content providers are increasingly converting traditional content (e.g., printed materials such as books, magazines, newspapers, newsletters, manuals, guides, references, articles, reports, documents, and the like) to electronic form.
For some content providers, a quick and simple way to convert printed content to an electronic form for publication is to create a digital image of the printed content, i.e., a digital image containing representation of text. As those skilled in the art will appreciate, this type of conversion is typically performed through the use of a scanner. However, while simply generating a digital image (or images) of printed content can be accomplished quickly, the resulting digital images might not be particularly well suited for various scenarios. For example, digital images corresponding to the conversion of pages of a book into electronic form may not be well suited in some viewing scenarios. Of course, the reasons that a digital image is not always an optimal form/format of delivery are many, but include issues regarding the clarity or resolution of digital images, the large size of a digital image file and, perhaps most importantly, the rendering of the digital images on various sized displays. For example, traditional digital images may be of a fixed size and arrangement such that a computer user must frequently scroll his or her viewer to read the text. In other words, text in a digital image is not reflowable with regard to the boundaries of the viewer.
Another approach to converting printed content into a digital form relates to converting the print images into corresponding digital text. Digital text comprises values corresponding to a printable character set, including alphanumeric characters. Exemplary character sets include the ASCII, EBCDIC, and Unicode character sets. However, converting printed content into digital text requires greater effort on the part of the content provider than simply generating a digital image. More particularly, the content provider must first generate (at least temporarily) a digital image of the content and then convert the text in the digital image into digital text using optical character recognition (OCR) software. As those skilled in the art will appreciate, OCR software scans a digital image and, in so doing, identifies digital characters from the pixels in the digital image. Unfortunately, OCR software can and often does make mistakes when matching collections of pixels to corresponding characters.
One approach to converting printed content into reflow digital content relates to processing content in a digital image into identifiable segments. An example of such an approach is set forth in co-pending and commonly assigned U.S. patent application Ser. No. 11/392,213, entitled “Converting Digital Images Containing Text to Token-Based Files for Rendering,” issued on Dec. 2, 2008 as U.S. Pat. No. 7,460,710, which is incorporated herein by reference. As described therein, the content in a digital image is broken up into “glyphs,” e.g., identifiable segments of content. In turn, the glyphs can be scaled and/or reflowed within the boundaries of a viewer. Generally described, “reflow” relates to the adjustment of line segmentation and arrangement for a set of segments. Digital content that can be rearranged according to the constraints of a particular viewer and without scaling can “reflow” within the viewer, and is reflow content.
With any automated conversion process the accuracy and presentation of the digital content is important. This is especially true for content providers who intend to offer their converted printed content for money. Unfortunately, nearly all printed content includes regions or blocks in the content which, if included in the reflow body of content or modified from a particular spatial arrangement, could corrupt the converted reflow content or otherwise degrade the visual presentation of the converted content. Examples of these types of “non-reflow” regions/blocks include, but are not limited to, headers, footers, sidebars, graphs, graphics, mathematical equations, tables, program listings, bulleted or numbered lists, poetry, and, in general, regions in which the spatial arrangement of the content (textual or otherwise) is important to that content.
In regard to “non-reflow” blocks of content, it should be understood that this term is used generically in regard to blocks of content that, for one reason or another, should not be “reflowed,” irrespective of the reason that the block of content should not be reflowed. More particularly, the term “non-reflow blocks of content” include both out-of flow blocks of content (where the content is related to but falls outside of the regular flow of content, including sidebars, headers, and footers) and spatial-dependent non-reflow blocks of content (where the spatial arrangement of the content precludes it from being reflowed) such as scientific formulas, lists, tables, and the like.
Quite frequently, non-reflow blocks can include some textual content. In these circumstances, the inclusion of the textual content with the reflow body of content can corrupt the integrity of the content. To further illustrate this point, FIG. 1 is a pictorial diagram illustrating a digital image 100 of printed content that includes both reflow and non-reflow blocks of content. More particularly, digital image 100 includes two paragraphs of text, paragraphs 102 and 104, which generally represent the reflow content of the digital image 100. Additionally, digital image 100 includes various non-reflow regions/blocks, including header 106, caption 108, graphic 110, separator line 112, and footnote 114, which is referenced from the text via footnote number 116.
With regard to content from non-reflow blocks corrupting the integrity of reflow content, the first sentence of paragraph 102, including text (not shown) from the previous page of content, if converted correctly, should read as follows:                Half the information has been used to pad and rearrange (modulate) the data in sequences and patterns designed to be accurately readable as a string of pulses.However, if the “text” of header 106 were to be erroneously included into/with the reflow content of paragraph 202, the above sentence would read:        Half the information has been used to pad and rearrange (modulate) the data in 180 Chapter 4 sequences and patterns designed to be accurately readable as a string of pulses.Clearly, adding “180 Chapter 4” to the reflow content corrupts the converted content and creates a scenario that would merely confuse a reader. As can be seen from this simple example, keeping the data of non-reflow blocks (such as header 106) from corrupting the reflow content is critical to the integrity of the converted content. More generally, excluding content in non-reflow blocks from being processed in the conversion of the general reflow content of a digital image 100 is essential to the integrity of the resultant digital content.        
Unfortunately, creating automated procedures for detecting non-reflow blocks of content, especially when the non-reflow blocks of content include textual content that could be converted as reflow content, has proven to be elusive. As such, manual editing is currently required to edit/finalize the converted digital content before it can be presented for “consumer” use.
Aspects of the present invention are directed at identifying and processing various types of non-reflow blocks of content in a digital image 100 such that the reflow content can be converted without corruption by the content of the non-reflow blocks. Other aspects of the present invention are further directed at identifying converted content that requires manual editing, thereby focusing and reducing the amount of manual editing to be performed.