1. Field of the Invention
The present invention is directed generally to a method of structuring and compressing labeled trees of arbitrary degree and shape for optimal succinctness, the method includes a transform for compressing and indexing tree shaped data. More specifically, the present invention includes a transform that uses path sorting and grouping to linearize labeled tree shaped data into two coordinated arrays, one capturing the structure of the tree and the other capturing the labels of the tree. The present invention also may include performing navigational operations on the labeled tree shaped data after the data has been transformed and possibly compressed.
2. Background of the Technology
Labeled trees are used for representing data or computation in computer applications including applications of tries, dictionaries, parse trees, suffix trees, and pixel trees. Trees are also used in compiler intermediate representations, execution traces, and mathematical proofs. XML also uses a tree representation of data where each node has string labels.
In a rooted, ordered, static tree data structure T on t nodes where each node u has a label in the alphabet Σ. The children of node u are ranked, that is, have a left-to-right order. Tree T may be of arbitrary degree and of arbitrary shape. Basic navigational operations, such as finding the parent of u (denoted parent(u)), the ith child of u (denoted child(u, i)) and any child of u with label a denoted child(u, α)) are important on tree data structures.
Initially, a solution for navigational operations was to represent the tree using a mixture of pointers and arrays using a total of O(t) RAM words each of size O(log t), which trivially supports such navigational operations in O(1) time taking a total of O(t log t) bits. However, these pointer based tree representations are wasteful in space.
Jacobson introduced the notion of succinct data structures, that is data structures that use space close to their information-theoretic lower bound and yet support various operations efficiently. Succinct data structures are distinct from simply compressing the input to be uncompressed later. See G. Jacobson, Space-efficient Static Trees and Graphs, FOCS 1989, 549-554, the contents of which are incorporated herein by reference.
Jacobson initiated this area of research with the special case of unlabeled trees, considering the structure of the trees but not the labels. Jacobson presented a storage scheme in 2t+o(t) bits while supporting the navigation operations in O(1) time. This method is also asymptotically optimal (up to lower order terms) in storage space, given the lower bound for the storage complexity of binary (unlabeled) trees of 2t−Θ(log t) bits [see Jacobson 1989].
Munro and Raman extended the method of Jacobson with more efficient as well as a richer set of operations, including sub-tree size queries. See I. Munro and V. Raman, Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs, IEEE FOCS 1997, 118-126, herein incorporated by reference.
Other known practices have further generalized these teachings to trees with higher degrees and ever richer sets of operations, such as level-ancestor queries. Succinct representations have been invented for other data structures including arrays, dictionaries, strings, graphs, and multisets.
However, each of these practices deals with unlabeled trees. The fundamental problem of structuring labeled trees succinctly has remained un-solved, even though labeled trees arise frequently in practice. Classical applications of trees in Computer Science, whether for representing data or computation, typically generate navigational problems on labeled trees.
The information-theoretic lower bound for storing labeled trees is 2t+t log |Σ|, where the first term follows from the cost of storing the structure of the tree [see Jacobson, 1989] and the second term is the lower-bound, known to anyone with ordinary skills in the art, for the storage complexity of the node labels. For example, see R. F. Geary, R. Raman and V. Raman, Succinct Ordinarl Trees with Level-Ancestor Queries in Procl. 15th ACM-SIAM symposium on Discrete Algorithms (SODA), pages 1-10, 2004.
Therefore, as labeled trees arise frequently in practice, there is a need in the art for a method allowing the succinct representation and efficient navigation of labeled trees.