1. Field of the Invention
The present invention relates to the conversion of data between different formats, and more particularly to such conversion between different forms of compound document architecture.
2. Data Architectures--General
In word processing, desk-top publishing, and the like, a variety of data formats have been developed. The simplest formats are designed to deal only with textual matter. Even with such an apparently simple situation, there are many aspects which may or may not be dealt with, and which, if they are dealt with, may be dealt with in a variety of ways. These include, for example, word wrapping, right justification, headers and footers, and margin control. More advanced formats include provision for things like footnotes and paragraph and section numbering, tabular information, geometric graphics, and bit image graphics.
As the nature of such formats becomes increasingly complex, it becomes convenient to distinguish between the general principles or rules of a format and the details of the implementation of such a set of principles. The set of principles is commonly termed an architecture, and an implementation is often termed an interchange format of that architecture.
With a simple architecture, such as that provided by a basic word processing system, the structural features of a document formatted under that architecture are very simple. With such an architecture, the document will normally be formatted as a stream of alphanumeric characters--the text--with whatever structural features it has being embedded within that stream. That is, features such as line returns will be represented by control codes occurring at the appropriate points within the character stream. Similarly, such matters as the starting and stopping of italics and bold-face can be represented by the inclusion of control characters in the character stream. This style of architecture can obviously be extended to more complicated situations. For example, if the architecture includes pagination, the codes for page endings can obviously be included in the character stream.
The appropriate information regarding such matters as line and page length must obviously be specified. In very simple cases, this can sometimes be dealt with by the operation of whatever printer the document is sent to. However, it is more usual for such information to be included within the document. If the line and page lengths are unchanged throughout the document, then this information will normally be included at the beginning of the document, preceding the text, and forming a header block.
This technique naturally extends to the provision of similar control blocks at appropriate points within the information stream where the format (in the sense of such matters as margins and page length) of the document changes. A simple example is the inclusion of a quotation in the form of a distinct paragraph using a smaller typeface than and with its margins inset from the main text. This will be preceded by a control block setting the typeface and margins, and will naturally be followed by another control block setting the original typeface and margins.
The discussion so far has largely assumed that the document is fully formatted. However, it is often convenient to separate the informational contents of the document from its final formatting. The document in its initial state is in what is termed processable form. (The term "informational contents" is to be taken in the broad sense of including the general parameters of the document such as page headers and footers, typeface, paragraph insets, narrowed margins, etc. are defined.)
With a processable document, the details of the actual page layout (primarily page width or line length and page length) are left undefined. When the document is finally printed as a hard copy, the printing system has to have such formatting information supplied to it, and the system has to calculate the positions of line and page endings and make appropriate adjustments (e.g. in page numbering and footnote positioning).
A major advantage of using documents in processable form is that editing of the document is simplified, for two reasons. One is that the editing of the document (usually on a word processor) does not have to take account of the details of the layout of the document on the page. This means that the operation of the word processor is faster, particularly in situations where the operator jumps between widely separated locations in the document. (If the document is maintained in fully formatted form, the system would have to reformat the whole of the text between the relevant locations in moving forward from one to the other).
The other reason is that the document can be transferred between different systems much more easily. Such different systems may involve different hardware, or something apparently as minor as a slight change of printer character style (which will require recalculation of line lengths), or perhaps simply a change in the size of paper. Again, if the document were fully formatted, such a change would involve reformatting the whole document, whereas with the document in processable form, no change is required to the document.
It will be assumed from here on that documents are in processable form.