As more and more users turn to computer networks such as the Internet and World Wide Web (hereinafter the “Web”) for information, content providers are increasingly converting traditional content (e.g., printed materials such as books, magazines, newspapers, newsletters, manuals, guides, references, articles, reports, documents, and the like) to electronic form.
For some content providers, a quick and simple way to convert printed content to an electronic form for publication is to create a digital image of the printed content, i.e., a digital image containing representation of text. As those skilled in the art will appreciate, this type of conversion is typically performed through the use of a scanner. However, while simply generating a digital image (or images) of printed content can be accomplished quickly, the resulting digital images might not be particularly well suited for various scenarios. For example, digital images corresponding to the conversion of pages of a book into electronic form may not be well suited in some viewing scenarios. Of course, the reasons that a digital image is not always an optimal form/format of delivery are many, but include issues regarding the clarity or resolution of digital images, the large size of a digital image file and, perhaps most importantly, the rendering of the digital images on various sized displays. For example, traditional digital images may be of a fixed size and arrangement such that a computer user must frequently scroll his or her viewer to read the text. In other words, the text of a digital image can not be “reflowed” within the boundaries of the viewer. Generally described, “reflow” relates to the adjustment of line segmentation and arrangement for a set of segments. Digital content, such as digital text, that can be rearranged according to the constraints of a particular viewer, without the necessity of scaling, can “reflow” within the viewer, and is reflow content.
A novel approach to converting printed content into reflow digital content relates to processing content in a digital image into identifiable segments. An example of such an approach is set forth in co-pending and commonly assigned patent application entitled “Method and System for Converting a Digital Image Containing Text to a Token-Based File for High-Resolution Rendering,” filed Mar. 28, 2006, U.S. patent application Ser. No. 11/392,213, which is incorporated herein by reference. As described in this reference, the content in a digital image is categorized into “glyphs,” e.g., identifiable segments of content that can be scaled and/or reflowed within the boundaries of a viewer.
One of the issues with creating documents of reflowable content is the resultant size of the document. However, for display purposes, it is preferable to store the reflowable content in a structured document, such as an XML document, that facilitates easy identification of structure, such as pages, paragraphs, words, etc. However, most standard document formats, including XML, are largely text based and thus include a lot of excess space/data to support the document format, but which is not necessary to the actual content itself. For example, FIG. 1A is a pictorial diagram illustrating portions of an exemplary XML source document 100 of reflowable content. Of course, the XML document 100 includes data tables, such as global glyph table 102, but those skilled in the art will recognize that the values stored therein are stored as textual representations of the actual values, which leads to “bloated” data areas.
FIG. 1A does illustrate the structural nature of printed material converted to reflowable content. For example, assuming that source document 100 represents a printed book converted to reflowable content, the document is structurally organized into pages, as indicated by pages 104 and 106. Each page is similarly segmented into one or more paragraphs, as indicated by paragraphs 108 and 110. Further structure in a reflowable document includes a list of words, such as words 112 and 114, within each paragraph. Moreover, consistent with the nature of the reflowable content, i.e., the words of the exemplary source document 100 are represented as glyphs, each word is comprised of a series of glyphs, such as glyphs 116-120.
It should be understood, however, that since the page content is represented by glyphs, the definition of a “word” of reflowable content may or may not correspond to what would be considered a word in a normal textual context. More particularly, when the term “word” is used this document with reference to glyph-represented reflowable content, the term “word” should be understood to refer to a collection or grouping of symbols and/or characters such that they are treated as a single unit. For example, with regard to FIG. 1B, which is a pictorial diagram of reflowable content, in the content are several textual words that may be grouped in a glyphing process as a single “word.” For example, the textual words “Stryker Sales”, as identified by box 152, may be grouped in a glyphing process as a single reflowable word, rather than as the two textual words that a human reader would likely view. Similarly, without understanding the context, a glyphing process would determine the text, “bar-triggering,” identified by box 154, as a single, reflowable word even though a human reader would likely not. Moreover, with regard to italicized textual words, which are converted to reflowable words, the italicized reflowable version of the text will very likely be viewed as a different word than the oblique (normal) reflowable version. In other words, a reflowable word “Stryker Sales” will be separate from the oblique reflowable word “Stryker Sales.”
With regard to the term “page,” while a page of content may correspond to the textual content imaged onto a paper sheet, the present invention is not so limited. Instead, a “page” of content corresponds to a section or segment of content intended for display as a whole.
Yet another issue with regard to using standard document formats relates to security and/or control over the reflowable content. For example, as those skilled in the art will appreciate, a document of reflowable content written to an XML document may be viewed by any number of viewers, thereby resulting in the loss of control by those who converted the document.