The present invention relates to computer software, and in particular, to an apparatus and method for comparing computer documents using tree structures.
Computer software documents may often be organized hierarchically. For example, FIG. 1 illustrates an XML document with subsections organized as hierarchical units. A typical XML file may include a main section 101 separated by start and end indicators 101A and 101B, which in XML are referred to as start-tags and end-tags. As used herein, the term “subsection” in the context of a program means portions of a computer program having start and end indicators between the main start and end indicators of the main section (i.e., the root) of the program. The main section 101 and subsections may include content (values) and other subsections. In XML, for example, all of the information from the start-tag to the end-tag is referred to as an XML element. In this example, section 101 (“<a>” to “</a>”) includes a subsection 102 (“<b>” to “</b>”), including start 102A and end 102B, and subsection 105 (“<c>” to “</c>”), which includes start 105A and end 105B. Subsection 102 may, in turn, include content or values and more subsections 103 (“<d/>”) and 104 (“<e/>”). Similarly, subsection 105 may include content or values and more subsections, such as subsection 106 (“<o/>”). From the above example it is clear that programming structures, such as XML documents, may be structured hierarchically in an almost unlimited number of configurations.
FIG. 2 illustrates a tree representation of the example program in FIG. 1. Here, the hierarchy of the XML document has been represented as a tree, with each node of the tree corresponding to a subsection of the program, and branches between the nodes represent parent-child (ancestor-descendent) relationships between the nodes. In this example, node A 201 corresponds to the main section 101 or root node of the XML document. Node A may include a value corresponding to the content between start 101A and end 101B. Values of a node are limited to the node and do not include anything from its sub-nodes or parent nodes. Similarly, node B 202 corresponds to subsection 102 of the XML document. Node B may include a value corresponding to the content between start 102A and end 102B. Likewise, node C 205 corresponds to subsection 105 of the XML document. Node C may include a value corresponding to the content between start 105A and end 105B. Finally, nodes D, E, and O 203, 204, and 206, respectively, correspond to subsections 103, 104, and 106, respectively and may each include content.
For numerous reasons, it is sometimes desirable to modify computer documents. For example, in many cases, computer documents may be modified to match an existing document. To modify a document to match the structure of an existing document, it is typically necessary to compare the structures of the two documents, delete certain subsections from the document being modified, and add certain subsections to the document being modified. However, analyzing document structures and comparing document structures can be complex and time consuming.
FIG. 3 illustrates comparison and modification of tree like structures for the example document in FIG. 1. In this example, two computer document tree structures are compared. Tree 300A may be the old structure, and tree 300B may be the desired new structure. In this simple example, it can be seen that the transformation of tree 300A into tree 300B requires the deletion of nodes X, Y, Z at 301, node D at 302, and node O at 303. Further, node F at 304, nodes D, I, E at 305, node B at 306, node H at 307, node D at 308, and node P at 309 must be inserted. Finally, children nodes of a common parent may have specific positions. Referring to FIG. 1, a document, such as a program, may include subsections in a specific order. For example, in FIG. 1 subsection D 103 precedes subsection E 104 in the document. In FIG. 3, node D at 302 in tree 300A has a specific order relative to node E. This order is different than the order of node D at 308 and a corresponding node E in tree 300B. From the example in FIG. 3, it can be seen that an analysis of the tree leads to three subsection (or subtree) deletions and six subsection insertions. Ordering of the document tree structures must also be accounted for. As computer documents grow larger and more complex, the difficulty in performing a comparison analysis and modification automatically also grows in complexity.
Thus, there is a need for improved comparison and modification techniques. The present invention solves these and other problems by providing an apparatus and method for comparing hierarchical computer artifacts that can be represented as tree like structures.