Electronic documents are created in many different ways. For example, desktop application programs, such as Microsoft Word, QuarkXPress, and Adobe InDesign, frequently are used to create electronic documents. These electronic documents contain various types of content arranged with a particular layout and style.
Oftentimes, it is desirable to preserve the graphic appearance of an electronic document. Image-based formats, such as TIFF, GIF, JPEG and the Portable Document Format (PDF), preserve the appearance of electronic documents. Electronic documents stored in such image-based formats, however, typically have large storage requirements. To reduce these storage requirements, many document analysis approaches have been developed for separating the structure of electronic documents that are stored in an image-based format from their contents. The structural information may be used to infer a semantic context that is associated with various contents in the electronic document or to convert the electronic documents into an editable file format.
Template-based electronic document formats describe a predefined layout arrangement of fields that are capable of accepting variable content. In some approaches, the size, shape and placement of the template fields are fixed. In another approach, an electronic document is represented as a template that contains predefined content areas whose positions and sizes may be varied within specified ranges. In particular, the content areas are defined by variables with respective value domains that define the size, position, and content of the content areas. A user specifies constraints that limit variable ranges and define relations between variables. A constraint solver generates a final document that satisfies all of the specified constraints.
None of the approaches described above, however, provides a way to automatically capture the graphic appearance of an electronic document in a way that is capable of accommodating variable content. With respect to textual content in particular, it is difficult to infer a graphic designer's intended layout from the actual position of the textual content, especially on unjustified sides of the text blocks and where text lines flow around neighboring logical blocks in the electronic document.