The use of XML to represent data is well established—particularly to transport data between databases. The use of XML to display data to end-users is growing rapidly—mainly due to the development of XSL (eXtensible Stylesheet Language) Transforms which allow the XML data to be transformed into a format which is easy for a human to read (usually the target of an XSLT (XSL Transform) is HTML (HyperText Markup Language) or plain text). The XSLT can run as a server-side transform before the data are transmitted to the client, but, more often, the XML data are transformed inside a client browser which a) reduces the amount of data which is transmitted over the network link and b) off-loads processing work from the server.
The meaning of a particular piece of XML data is defined by the open and close tags which enclose it. Therefore the sequence <TAG>Data</TAG> can be considered as a data block. <TAG> is the open tag and marks the start of a particular piece of XML data. “Data” is data itself. </TAG> is the close tag and marks the end of a particular piece of XML data. The data section of a data block may contain one or more further data blocks meaning that data blocks can be nested to form a hierarchy. The physical order of the data blocks in the XML file is not usually meaningful—it is the hierarchical position of the data block (i.e. the list of all the tags which enclose it) which entirely defines the meaning of a particular piece of data. This means that data blocks in an XML file can be reordered without changing the overall meaning of the file. This makes it very difficult for a human to compare two XML files and notice any differences, even if the XML has been transformed into something which is more user-friendly (such as HTML which is displayed in a browser).
There are existing tools (such as DIFF) which compare two sets of flat text data and highlight any differences. DIFF identifies changes on a line-by-line basis but lines are not usually significant in XML: an entire XML file might contain only a single line. Even after it has been transformed, the data might not contain multiple lines (if the target of the transform is HTML) or the transformed data might display multiple data elements per line (for example, a spreadsheet). Merely highlighting a changed line is therefore of little value. Furthermore, these existing tools will highlight reordering as a change and therefore are of limited use on XML data (or transformed XML data) since they produce false highlights when XML blocks are reordered but the actual data meaning is unchanged. DIFF is a computationally intensive operation because it attempts to match up reordered lines and this requires intensive repeated comparison between the two versions of the data.
It would be desirable to have a simple method and apparatus for identifying an update between two versions of a data file having a plurality of blocks of data, the meaning of the data file being insensitive to the ordering of the blocks of data within the data file.