Document assembly refers to the generation of an instance document from one or more source documents. In general, a source document is a generic template document, and additional information specific to the relevant circumstances is required to generate an instance document from one or more source documents. This additional information can originate from a user and/or some other data source. Document assembly software has been developed for generating documents that typically contain large amounts of common text or data with a smaller amount of varying detail text or data. Document assembly software is useful because, where a suitable source document exists, it enables instance documents to be produced more efficiently than may otherwise be the case using a standard word processor. A form letter is perhaps the simplest and most familiar example of a source document, and can be used to generate instance letters for a number of recipients. An instance letter is typically generated from a single source document and addressee information, such as the addressee's first and last names, title, and address. More complex instance documents, such as legal or financial documents, can be generated from one or more source documents, based on information specific to the parties involved and the circumstances of their relationships.
A source document is represented in a document assembly system in some data format. Common data formats (only some of which are commonly used for source documents for a document assembly system) include plain text, Microsoft's proprietary Microsoft Word “doc” format, the rich text format (RTF), portable document format (PDF), and hypertext markup language (HTML). A data format which is now being used for a wide variety of applications is extensible markup language (XML), as described at http://www.w3.org/XML. An XML document combines the text of a document with tags that markup that text into logical elements. As a data format for storing documents generally, XML has a number of advantages over other data formats. In particular, XML can be used to markup text in a way that tags it with its meaning or purpose, and applications can manipulate the text on the basis of these tags. Tools for parsing and manipulating XML data are available from a variety of vendors.
XML allows a document grammar to be defined which an XML document must match if it is to be said to be valid with respect to that grammar. If a document is valid, then systems that can handle documents matching that grammar can manipulate those documents taking advantage of the grammar. Such a grammar is often contained within a “document type definition” (DTD) or “XML schema”. There are many different grammars for XML documents that are designed to meet specific needs. For example, the DocBook document type definition, documented at http://www.oasis-open.org/docbook/xml/, was designed to meet very general documentation requirements.
A document assembly system preferably performs a number of basic functions. First, it determines, on the basis of data provided to it, which parts of a source document to include in or exclude from a resulting instance document. For example, a paragraph, sentence or phrase might only be included in a legal contract if there is a guarantor. Second, the system can also include in the instance document text which is not present in the source document. For example, a date, an address, or where a user of the system enters a yearly rental, the amount calculated to be payable per calendar month. In order to be able to provide these two basic functions, a document assembly system stores (i) information as to which parts of the source document may be included or excluded from the instance document, and (ii) information as to the locations in the document in which additional text may be inserted.
It is also desirable to be able to repeat a passage of text a specified number of times, but with different data inserted at certain points within the passage in each repetition. This requires the ability to identify the passage to be repeated, the number of times to repeat it, and the data to be inserted into each repetition.
Existing document assembly products that work with source documents which are not XML based often encode the information described above directly in the source document. This is possible with XML source documents as well. The information could be stored as additional elements or attributes in the XML document itself. However, a serious difficulty with this approach is that the document will not be valid unless the grammar is altered to allow the inclusion of that information.
In an alternative approach, taught in U.S. Pat. No. 6,006,242 (Poole, et al. “Apparatus and method for dynamically creating a document”), entity references are embedded in a document instance, and a dedicated entity resolver is used during the document assembly process to replace the entity references with text particular to the instance document. One problem with this approach is that the source document will not validate against the original grammar unless the validating parser being used uses that dedicated entity resolver.
Because in each of these approaches the document no longer validates against the original grammar using a validating XML parser, the ability to manipulate the document with 3rd party XML-aware applications is significantly curtailed.
It is desired to provide a document assembly method and system, a method for generating a source document for a document assembly system, a source document for a document assembly system, a logic source for a document assembly system, and a grammar for a logic source for a document assembly system that ameliorate one or more of the above difficulties, or at least provide a useful alternative.