XML is now in widespread use as a storage or exchange format for documents and data. Such documents and data generally undergo changes and these changes need to be monitored and actions taken based on what changes have been made. There are many implementations of XML comparison technology including that described in EP 1325432 which is an earlier development by the same inventor, the content of which is incorporated by reference, and others including Altova Spy DiffDog™, Microsoft™ Diff and Patch, IBM Alphaworks™ XMLcmp™, Versim™. Generally these take two markup language documents and generate some representation of the differences between them, often referred to as a ‘delta’ file.
Another set of problems specifically relating to representing changes between three or more documents is addressed in EP 2174238 which is also an earlier development by the same inventor, the content of which is incorporated by reference. Alternative approaches to this set of problems generally involve the use of version control systems and their focus is on compact storage of changes and such mechanisms are typically not useful for processing changes. An earlier proposal by the inventor for a format for a multiple version document is DeltaXML Unified Delta™, described in “Russian Dolls and XML: Handling Multiple Versions of XML in XML” XML 2003, December, 2003, USA. This proposal was more suited to the processing of changes between versions and a generic solution was proposed in “A Generalized Grammar for Three-way XML Synchronization” XML 2005, USA. However, a study of this will show that although a generic solution is possible, the architecture and execution is complex. EP 2174238 provided an improvement on this previous work.
The approaches and systems identified above do not, though, enable structural differences between the markup language source files to be represented simply and easily in a delta file. A structural difference is a difference with respect to the element hierarchy, e.g. when a new element is introduced to wrap some existing content. Representing structural differences is a known issue and is referred to as the problem of “overlapping hierarchies”, i.e. when a single tree structure is insufficient to describe two or more ways of structuring some information.
In terms of the practical need for representing structural differences a good example is the requirement for The Stationery Office, UK to publish consolidated legislation. Jeni Tennison notes, in an article discussing these issues, the importance of consolidated legislation showing the places where ‘current’ legislation was amended over time from its original, enacted state and the fact that the authors of legislation care little for document structures, and amendments often overlap document structures such as paragraphs and list items, and each other.
In the above example, for legal documents it is changes to the actual text that is deemed to be most important but the XML document format used to represent legislation would use the XML structure to represent the layout of the document, including sections, paragraphs, lists and tables. Although the delta format described in EP 2174238 would be capable of representing successive versions of such a legal document, it would in some cases be necessary to add and delete sections of content simply because of changes to the structure and layout. This is not ideal because the delta document will indicate changes to the legislative text when there may be no changes to the actual text, but only to the structure or layout.
The general problem of overlapping hierarchies has been discussed for some years. In an article about non-hierarchical structures (which includes overlapping hierarchies) issued by the Text Encoding Initiative (TEI) consortium it was noted that: “non-nesting information poses fundamental problems for any XML-based encoding scheme, and it must be stated at the outset that no current solution combines all the desirable attributes of formal simplicity, capacity to represent all occurring or imaginable kinds of structures, suitability for formal or mechanical validation.” (see http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html)
In an article at http://www.jenitennison.com/blog/taxonomy/term/9 by Jeni Tennison a summary of the current approaches may be found and is presented below in Table 1. It should be noted, though, that recent work has focused on the problem of overlapping hierarchies within a single document by which is meant overlapping hierarchies in a document where the content, i.e. the text, is unchanged. Existing work in this area has failed to develop a simple and easy way of representing changes to content in addition to overlapping hierarchies.
TABLE 1TechniqueAdvantagesDisadvantagesMilestoneseasy to seefavours one main structure;main structurehard to identify content of overlappingstructuresfragmentationeasy to seefavours one main structure;main structure;leads to spurious containment;easy to workcan lead to discontinuous elementsout content ofoverlappingstructuresFlattenedall structureshard to see any structure;treated equallyhard to process naturally using XMLtoolsstand-offall structureshard to see any structure;treated equallyhard to process naturally using XMLtoolsmultipleeasy to seehard to edit without toolsdocumentindividualcontent gets repeated;structurescomplex to align structures;hard to do cross-hierarchy analysis;hard to edit without toolsMilestones
The applicant has previously proposed a solution (see http://www.deltaxml.com/attachment/481-dxml/XML-change-tracking.pdf) which is geared towards change tracking. However within this solution it is not possible to extract or process a particular version without working back through all the versions prior to the latest version of the document. This is because each change is represented as an incremental ‘roll-back’ from the latest version of the document, and a roll-back can typically only be applied if all later roll-backs have been applied. Moreover, this approach suffers from the disadvantages of milestones noted in Table 1: it is good for representing tracked changes but poor for processing multiple hierarchies.
Fragmentation
In an article (http://www.jenitennison.com/blog/node/98) by Jeni Tennison labelling of elements is proposed to make extensible stylesheet language transformation (XSLT) processing simpler. However the labelling is done as an automated process which needs to know which element name relates to which hierarchy. The proposed method is, though, limited. The method uses id attributes to indicate that one element is the same as another, i.e. that the ‘original’ element is a concatenation of all elements with the same id. This would not allow an element fragment to be, for example, part of the same element type but in two different versions. This is because only one id attribute is allowed.
Stand-Off Markup
Stand-off markup, which is also known as remote markup or stand-off annotation, is the kind of markup that resides in a location different from the location of the data being described by it. It is thus the opposite of inline markup where data and annotations are intermingled within a single location [see http://wiki.tei-c.org/index.php/Stand-off markup]. Documenting changes to content as well as overlapping hierarchies is easier to do using stand-off markup because content fragments can be defined and then referenced from different hierarchies.
However, stand-off markup is not easy to process with XSLT because the hierarchies are separated and not related except by content of the leaf or text nodes. Therefore it is powerful in terms of the structures it can represent but weak in terms of how it represents the relationship between the structures.
Multiple Document
This is similar, in some ways, to the delta format described in EP 2174238 in that it cannot show the relationship between the content of different element types, these need to be in separate document fragments. Therefore it is not good for overlapping hierarchies.