As more and more users turn to computer networks such as the Internet and World Wide Web (hereinafter the “Web”) for information, content providers are increasingly converting traditional content (e.g., printed materials such as books, magazines, newspapers, newsletters, manuals, guides, references, articles, reports, documents, and the like) to electronic form.
For some content providers, a quick and simple way to convert printed content to an electronic form for publication is to create a digital image of the printed content, i.e., a digital image containing representation of text. As those skilled in the art will appreciate, this type of conversion is typically performed through the use of a scanner. However, while simply generating a digital image (or images) of printed content can be accomplished quickly, the resulting digital images might not be particularly well suited for various scenarios. For example, digital images corresponding to the conversion of pages of a book into electronic form may not be well suited in some viewing scenarios. Of course, the reasons that a digital image is not always an optimal form/format of delivery are many, but include issues regarding the clarity or resolution of digital images, the large size of a digital image file and, perhaps most importantly, the rendering of the digital images on various sized displays. For example, traditional digital images may be of a fixed size and arrangement such that a computer user must frequently scroll his or her viewer to read the text. In other words, the text of a digital image can not be “reflowed” within the boundaries of the viewer. Generally described, “reflow” relates to the adjustment of line segmentation and arrangement for a set of segments. Digital content, such as digital text, that can be rearranged according to the constraints of a particular viewer, without the necessity of scaling, can “reflow” within the viewer, and is reflow content.
A novel approach to converting printed content into reflow digital content relates to processing content in a digital image into identifiable segments. An example of such an approach is set forth in co-pending and commonly assigned patent application entitled “Method and System for Converting a Digital Image Containing Text to a Token-Based File for High-Resolution Rendering,” filed Mar. 28, 2006, U.S. patent application Ser. No. 11/392,213, which is incorporated herein by reference. As described in this reference, the content in a digital image is categorized into “glyphs,” e.g., identifiable segments of content that can be scaled and/or reflowed within the boundaries of a viewer.
When presenting converted content that can be reflowed in a viewer according to viewer constraints, it is desirable to recognize the similarities in paragraph layout such that similarly formed paragraphs are reflowed in a similar manner. While a human can readily recognize patterns, context, and, therefore, similarities among the layout and flow of paragraphs on a printed page, determining the similarities via a computer is often problematic. Moreover, the level of difficulty increases when the paragraphs are organized into anything but the most simplest form. For example, recognizing similarly formed paragraphs organized in a multi-column format is extremely difficult. Nevertheless, as discussed above, recognizing similarly formed paragraphs is very desirable.