The need to identify the parent of a subfragment arises frequently in a variety of fields ranging from engineering and medicine to electronics, computer science, physics, chemistry and biology. In most of the cases the problem is computationally intractable since the number of variables involved as a consequence of the degrees of uncertainty, renders the calculations impossible. This situation is aggravated by the common occurrence that the integrity of the sub-fragment may be compromised in some way.
Typically, when the pattern to be recognized is inherently a “two-dimensional” structure, it cannot be adequately represented using a one-dimensional (string or circular string) approximation. By representing the pattern as a tree and by utilizing tree comparison algorithms one can, generally speaking, achieve excellent recognition strategies. Indeed, such schemes have been utilized in Pattern Recognition (PR) in areas such as clustering by Lu (IEEE Trans. Pattern Anal. and Mach. Intell., PAMI 1, pp. 219-224 (1979)) and by Cheng and Lu in waveform correlation (IEEE Trans. PAMI, PAMI 7, pp. 299-305 (1985)). However, when the pattern to be recognized is occluded and only noisy information of a fragment of the pattern is available, the problem encountered can be mapped onto that of recognizing a tree by processing the information in one of its noisy subtrees or subsequence trees.
Trees are a fundamental data structure in computer science. A tree is, in general, a structure which stores data and it consists of atomic components called nodes and branches. The nodes have values which relate to data from the real world, and the branches connect the nodes so as to denote the relationship between the pieces of data resident in the nodes. By definition, no edges of a tree constitute a closed path or cycle. Every tree has a unique node called a “root”. The branch from a node toward the root points to the “parent” of the said node. Similarly, the branch of the node away from the root points to the “child” of the said node. The tree is said to be ordered if there is a left-to-right ordering for the children of every node.
Trees have numerous applications in various fields of computer science including artificial intelligence, data modeling, pattern recognition, and expert systems. In all of these fields, the trees structures are processed by using operations such as deleting their nodes, inserting nodes, substituting node values, pruning sub-trees from the trees, and traversing the nodes in the trees. When more than one tree is involved, operations that are generally utilized involve the merging of trees and the splitting of trees into multiple subtrees. In many of the applications which deal with multiple trees, the fundamental problem involves that of comparing them.
Trees, graphs, and webs are typically considered as a multidimensional generalization of strings. Among these different structures, trees are considered to be the most important “nonlinear’ structures in computer science, and the tree-editing problem has been studied since 1976. Similar to the string-editing problem, (see: D. Sankoff and J. B. Kruskal, Time wraps, string edits, and macromolecules: Theory and practice of sequence comparison, Addison-Wesley (1983); R. A. Wagner and M. J. Fischer, J. Assoc. Comput. Mach., 21:168-173, (1974); B. J. Oommen and R. L. Kashyap, Pattern Recognition, 31, pp. 1159-1177 (1998); P. A. V. Hall and G. R. Dowling, Comput. Sur., 12: pp 381-402 (1980); R. L. Kashyap and B. J. Oommen, Intern. J. Computer Math., 13: pp 17-40 (1983); R. Lowrance and R. A. Wagner, J. ACM, 22: pp 177-183 (1975)), the tree-editing problem concerns the determination of the distance between two trees as measured by the minimum cost sequence of edit operations. Typically, the edit sequence considered includes the substitution, insertion, and deletion of nodes needed to transform one tree into the other.
Unlike the string-editing problem, only few results have been published concerning the tree-editing problem. In 1977, Selkow (Inform. Process. Letters, 6(6):184-186, (1977)) (see also Sankoff and J. B. Kruskal, Time wraps, string edits, and macromolecules: Theory and practice of sequence comparison, Addison-Wesley (1983)) presented a tree editing algorithm in which insertions and deletions were only restricted to the leaves. Tai (J. Assoc. Comput. Mach., 26:422-433 (1979)) in 1979 presented another algorithm in which insertions and deletions could take place at any node within the tree except the root. The algorithm of Lu (IEEE Trans. Pattern Anal. and Mach. Intell., PAMI 1(2):219-224 (1979)) on the other hand, did not solve this problem for trees of more than two levels. The best known algorithm for solving the general tree-editing problem is the one due to Zhang and Shasha (SIAM J. Comput., 18(6):1245-1262 (1989)). Also, in all the papers published till the mid-90's, the literature primarily contains only one numeric inter-tree dissimilarity measure—their pairwise “distance” measured by the minimum cost edit sequence. The literature on the comparison of trees is otherwise scanty: Shapiro and Zhang (Comput. Appl. Biosci. vol. 6, no. 4, 309-318, (1990)) has suggested how tree comparison can be done for ordered and unordered labeled trees using tree alignment as opposed to the edit distance utilized elsewhere (Zhang and Shasha (1989) supra). The question of comparing trees with variable length don't care edit operations was also solved by Zhang, Shasha and Wang (Proceedings of the 1992 Symposium on Combinatorial Pattern Matching, CPM92:148-161, (1992)). Otherwise, the results concerning unordered trees are primarily complexity results: Zhang, et al., (Information Processing Letters, 42:133-139, (1992)) showed that editing unordered trees with bounded degrees is NP-hard, and even MAX SNP-hard by Zhang and T. Jiang, (Information Processing Letters, 49:249-254 (1994)).
The most recent results concerning tree comparisons are probably the ones due to Oommen, Zhang and Lee (IEEE Transactions on Computers, TC-45:1426-1434, (1996)) In this publication, the authors defined and formulated an abstract measure of comparison, Ω(T1, T2), between two trees T1 and T2 presented in terms of a set of elementary inter-symbol measures ω(.,.) and two abstract operators. By appropriately choosing the concrete values for these two operators and for ω(.,.), the measure Ω was used to define various numeric quantities between T1 and T2 including (i) the edit distance between two trees, (ii) the size of their largest common sub-tree, (iii) Prob(T2|T1), the probability of receiving T2 given that T1 was transmitted across a channel causing independent substitution and deletion errors, and, (iv) the a posteriori probability of T1 being the transmitted tree given that T2 is the received tree containing independent substitution, insertion and deletion errors.
Unlike the generalized tree editing problem, the problem of comparing a tree with one of its possible subtrees or Subsequence Trees (SuTs) has almost not been studied in the literature at all. The only reported results for comparing trees in this setting have involved constrained tree distances and are due to Oommen and Lee, (Information Sciences, Vol. 77 No. 3,4:253-273 (1994)) and Zhang, (Proceeding of the IASTED International Symposium, New York, pp. 92-95 (1990)).