1. Field of the Invention
The present invention relates to an efficient mechanism for the differentiation and update of data structured in tree format. The invention has particular application to version management of tree structured data but the tree differentiation process according to the invention can differentiate any two trees, regardless of whether they are successive versions or not.
2. Background Description
The information used by computer programs can be represented in multiple formats. There is, however, a growing trend towards moving much of these data into a standardized tree structured for-mat. The XML (eXtensible Markup Language) and DOM (Document Object Model) standard proposals are two leading efforts in this trend. The prospect of having a preferred data representation format raises the question of the adequacy of the existing mechanisms of data management.
Consider the case of information (source code, textual data, etc.) that varies over time. Version management is then a critical issue. At the core of the version management issue lies the following problem: given two successive states of the information, it is necessary to be able to describe the informational difference between these two versions, and to bring older versions up to date with newer ones.
In the current approach, version management is achieved by managing external representations of the data, rather than the data itself. The information is first converted into some conventional representation, typically a sequence of text lines or bytes, which is then processed. The internal structure of the data is irrelevant in this model; rather, all information is treated as a more or less unstructured sequence of tokens.
While sufficient for many computational purposes, this model has two major drawbacks. The first one arises from the intrinsic ambiguity of the conversion to an external format: the same data can have multiple external representations in a given format. As a result irrelevant difference reports ("false negatives") are often generated.
Second, and most important, the way differences are reported (e.g., in terms of byte or line mismatches) bears no relation with the intrinsic structure of the data, and requires an additional "interpretative" step to infer the actual informational difference.
It would seem then that there are big advantages in doing version management directly on the internal representation of the data. When tree structured data is considered, however, there is one major obstacle: the high computational cost of the tree differentiation algorithms. As an indication of this, consider the cost of the optimal tree differencing algorithm: for labeled ordered trees, the cost is no less than EQU Nodes(tree 1).times.leaves(tree 1).times.Nodes(tree 2).times.leaves(tree 2), that is, the cost increases at least quadratically when the size of the tree increases. For unordered labeled trees the situation is still worse. Optimal differentiation has exponential cost in the worst case, EQU Nodes(tree 1).times.Nodes(tree 2)+[leaves(tree 1)]!.times.3**[leaves(tree 1)1.times.Nodes(tree 2).times.. . .
The consequence is that tree differencing algorithms become impractical for most realistic problems as soon as the size of the trees involved starts to grow, unless large amounts of time and computing power are available.