Data retrieval refers to obtaining data from a database by using a database management system (DBMS). Different forms of data retrieval technologies are suitable for different types of applications and some forms of data storage are particularly suitable for implementing certain specific tasks.
For example, a tree data structure is a commonly-used form of non-linear data retrieval technique. A tree is represented as a set of linked nodes to describe a hierarchical data set.
The following terms are generally used in the tree structure.                Root—The top node in a tree.        Child—A node directly connected to another node when moving away from the Root.        Parent—The converse notion of a child.        Descendant—A node reachable by repeated proceeding from parent to child.        Ancestor—A node reachable by repeated proceeding from child to parent.        Leaf/External Node—A node with no children.        Internal node—A node with at least one child.        Edge—Connection between one node to another. Under certain circumstances, an edge corresponds to a condition (e.g. an input key), which triggers a state transition from one node to another node.        Subtree—For a node in a tree, a subtree is formed by the node and all of its descendants.        
On the basis of tree data retrieval, recent developments in the area of data compression are described as follows.
Trie
A trie, also called a prefix tree, is one type of tree data structure. In a trie, all the descendants of a node have a common prefix of the string associated with that node and the common prefix is the string associated with the node. The root is associated with an empty string. Furthermore, no node in a trie stores the key associated with that node; instead, the position of a node in the trie defines the key which the node is associated with. In addition, each node corresponds to an edge linked from its parent node.
Typically, a regular trie has the following basic characteristics: 1) each node, except the root, corresponds to an edge; 2) a character associated with a node is labeled on the corresponding edge of the node, and the set of characters labeled on the path from the root to a node forms a string (i.e. key) associated with the node; and 3) all children of a node are associated with different characters.
FIG. 1 depicts an example of a trie. In this example, the trie consists of seven data strings: “abc”, “abcd”, “abd”, “b”, “bcd”, “efg”, and “hii”. In FIG. 1, a double circle node shown in the trie denotes that the node is a final state, which means the node contains a valid string of a set of strings, and a single circle node shown in the trie denotes that the node is not a final state.
The following basic operations are typically supported in a trie.
Forward Search: this operation determines if or not a known string exists in a trie. Forward search starts from the root in a trie to search for the first character of the known string and search continues by going to a respective subtree through an edge associated with the first character; the second character of the string is searched for in the subtree and search continues by selecting another edge associated with the second character and going to another subtree; the above search operation is performed iteratively until all characters of the string are found at a certain node, and information on this node may be retrieved to complete the forward search; otherwise, the forward search ends without finding the known string.
Reverse Search: this operation retrieves a string of which the location in the trie is known. Reverse search starts from any node in a trie and the known path from the starting node to the root is traversed to obtain a corresponding string. The obtained string is reversed so that the string matching the known path is retrieved.
Insertion: first, a location in a trie to insert a node is determined by performing forward search based on the string to be inserted to the trie. If the complete path corresponding to the string to be inserted does not exist in the trie, one or more new nodes corresponding to the remaining characters are created at the determined location in the trie; if the complete path already exists in the trie, the corresponding node is set as being a final state, i.e. the string associated with the node exists in the trie.
Deletion: first, a location in a trie to delete a node is determined based on the string to be deleted from the trie. If the determined node is a leaf, deletion operation is performed iteratively until all leaves associated with the to-be-deleted string are removed; if the determined node is an internal node, this node is set as being a non-final state, i.e. the string associated with the node does not exist in the trie anymore.
It should be noted that, in traditional tries represented with pointers, merely pointers pointing from parent nodes to child nodes are stored, thereby supporting only forward search in a trie. If reverse search shall be supported in this type of pointer-based tries, an additional set of pointers pointing from children nodes to parent nodes need to be stored and, thus, extra storage space has to be used.
A trie may be used to replace other forms of data storage such as a hash table. A common application of a trie is storing texts or dictionary, such as that found on a mobile telephone. It is understandable that a trie uses common prefixes among strings to reduce data redundancy and minimize senseless comparison in various operations. However, a trie becomes inefficient when it is used to store a large amount of strings that do not have common prefixes substantially.
Compressed Trie
A trie may be further compressed. A feasible way to implement trie compression is to compress edges in a trie, that is, to merge nodes in the trie satisfying a predetermined condition. A trie compressed through node merge may be referred as a PATRICIA (Practical Algorithm To Retrieve Information Coded In Alphanumeric) trie.
A PATRICIA trie optimizes the nodes' space. Specifically, in a PATRICIA trie, two nodes may merge to one node if one of the two nodes is the only child of the other node. As such, some edges in a PATRICIA trie may be labeled with strings having more than one character.
FIG. 2 depicts an example of a PATRICIA trie. In this example, the trie also consists of seven strings: “romane”, “romanus”, “romulus”, “rubens”, “ruber”, “rubicon”, and “rubicundus”. As shown in FIG. 2, nodes satisfying the above-mentioned condition are merged and corresponding edges are compressed. For example, the compressed edge between a leaf (i.e. Node 5) and its parent (i.e. Node 2) is labeled with the string “ulus”. It should be noted that, in some implementations, additional requirements may be applied to the merge condition. For example, it may be required that node merge is made only when the number of characters corresponding to the nodes to be merged reaches a predetermined number, that is, the edge is longer than a predetermined length (e.g. 2).
A PATRICIA trie is suitable for a string set of which the strings have long common prefixes. For example, PATRICIA trie is particularly suitable for IP address management in the area of IP routing.
Succinct Data Compression
A succinct data structure is a type of data structure that may be used to implement lossless data compression algorithms. Through data compression, a succinct structure uses an amount of space that is close to information-theoretic lower bound, but still allows for efficient query operations. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use the data “in-place” without decompressing the data first.
A succinct data structure may be defined from the perspective of space complexity. Suppose that Z is the information-theoretical optimal number of bits needed to store some data. A representation of this data is called “succinct” if it takes Z+o (Z) bits of space to store this data, where o (Z) denotes space complexity. Accordingly, a data structure that uses Z+√{square root over (Z)} bits of storage is succinct and another data structure that uses Z+lg Z of bits is also succinct.
Succinct indexable dictionaries, also called rank/select dictionaries, form the basis of a number of succinct representation techniques. In this type of dictionaries, a subset S of a universe U=[0 . . . n)={0, 1, . . . , n−1} is stored. The subset S is usually represented as a bit array B[0 . . . n), where B[i]=1 when and only when iϵS.
In addition to usual methods on dictionaries such as queries and insertions/deletions, an indexable dictionary also supports two special operations, i.e. rank and select:
rankq(x)=|{kϵ[0 . . . x]:B[k]=q}|; and
selectq(x)=min{kϵ[0 . . . n):rankq(k)=x}
for qϵ{0, 1}.
In other words, rankq(x) returns the number of elements equal to q up to position x while selectq(x) returns the position of the x-th occurrence of q.
As mentioned above, succinct indexable dictionaries form the basis of a number of succinct representation techniques. Accordingly, for example, data structures such as binary trees, k-ary trees, multisets, and suffix trees/arrays may be represented by a succinct data structure. As a specific example, an arbitrary binary tree of n nodes can be represented by 2n+o (n) bits in the succinct representation.
LOUDS Trie
A LOUDS (Level-Order Unary Degree Sequence) trie is a trie implemented by the succinct representation. In a LOUDS trie, no pointers are used to store locations of nodes. Instead, a trie is encoded in a succinct bit array string to achieve an efficient representation of the trie structure.
A LOUDS bit string may be created as follows. Starting from the root, a trie is traversed in breadth-first order, i.e. all nodes at the same level are traversed before going to the next level. When it is found that a node has d children (d>0), d “1”s and one “0” are used to represent this node. Accordingly, a leaf may be represented by a “0”. In addition, a prefix “10” is added to the LOUDS bit string and the prefix represents an imaginary super root pointing to the root. It thus can be seen that an n-node LOUDS trie consumes 2n+1 bits to represent the trie structure. FIG. 3 depicts an example of a LOUDS trie created in accordance with the rules described above.
In addition, a LOUDS trie may also use an n-bit array (e.g. the bit array B described above with respect the succinct data structure) to store whether a node in the trie is a final state. It thus can be seen that the LOUDS representation may be used to implement the structure of any trie including the PATRICIA trie and a LOUDS-implemented trie stratifies the storage requirement as specified by the succinct structure, i.e. Z+o (Z). Further, an n-byte array may be used to store the transition characters on the edges of an uncompressed n-node trie.
Through the rank and select operations of the succinct data structure, a LOUDS trie supports finding the parent (get parent) and any child (get nth child) for a node in the trie. Accordingly, in a LOUDS trie, both forward search and reverse search may be performed more efficiently than the traditional pointer-based approach.
Based on the above description, those skilled in the art would appreciate that a trie may be compressed by merging nodes in the trie and the LOUDS representation may be used to represent the structure of a trie to support efficient forward search and backward search.
However, data redundancy may still exist in a node-merged trie. As in the example of FIG. 2, it can be found that a common prefix “u” exists among the string “ub” on the compressed edge from Node 1 to Node 3, the string “ulus” on the compressed edge from Node 2 to Node 5, the string “us” on the compressed edge from Node 4 to Node 9, and the string “undus” from Node 7 to Node 13. Similarly, a common prefix “o” exists between the strings “om” and “on” in the trie of FIG. 2. Thus, the storage space efficiency of a trie may be improved if redundancy in a compressed trie (e.g. a PATRICIA trie) can be further reduced.