1. Field of Invention
The present invention relates generally to the field of XML node identifiers. More specifically, the present invention is related to self-adaptive prefix encoding of stable node identifiers.
2. Discussion of Prior Art
In XML databases with an object storage model, such as XPath 2.0 and XQuery 1.0 data model or DOM (Document Object Model), node identifiers are fundamental to operations including maintaining document order, searching, updating, and concurrency control. Node identifiers are assigned when an XML document is converted into an object model and stored into a database, or when new nodes are inserted in a logical XML tree. Existing solutions for encoding node identifiers can be classified as physical or logical solutions.
In a storage system that organizes storage space using pages and records, a physical node identifier is typically either a record identifier (RID) or an extension of an RID, depending on whether a node is a record or a structure within a record. An RID is a page number followed by the index of a record within that page. The entry with that particular index within an array indicates the offset of a record within a page. If a node is a structure within a record, then a node identifier usually consists of an RID and another index to locate the node within the record. An RID can be used to locate a record; to locate a node inside a record, it is necessary to have an RID with either an offset within a record or a slot index if the record has a layout similar to a structured document page. If such objects are stored in memory, physical IDs are usually memory addresses.
RIDs are treated as physical storage addresses for records. One advantage of using RIDs as a node identifier is the ability to quickly position a node in physical storage. However, RIDs cannot provide for document order of nodes. Because there is no parent-child relationship information within a physical node id, it alone is unsuitable for direct use in sub-document concurrency control, which typically requires the use of ancestor-descendant relationship information. A separate structure keeping track of parent-child relationships is necessary to provide for concurrency control. Another disadvantage is that physical node ids are used as reference pointers in parent-child relationships, making re-organization across pages of XML objects difficult. This is because moving a node across a page requires a new RID, and all references to the moved node would need to be updated to accommodate the new RID. Otherwise, a forward record, which is a record that contains an address rather than an actual record, would be needed. Because a forward record contains an address of an actual record that is physically removed from the current record, it is more costly to access in terms of input and output (I/O) operations.
An interval encoded node identifier, which is an example of a logical node identifier, uses a pair of integers specifying starting and ending positions and optionally, a level number (startpos:endpos, levelno). The start and end position of a node are either the logical offset of the start and end of a node's position in the text of an XML document or sequence numbers corresponding to node entry and exit in a pre-order traversal of the XML object tree. For two nodes, n1 and n2, with node ids (s1:e1, |1) and (s2:e2, |2), respectively, if the start position of the first node is less than the start position of the second node and the end position of the first node is greater than the end position of the second node, then the first node is an ancestor of the second node. In addition, if the level number of the first node is increased by one to result in the second level number, then the first node is a parent of the second node.
This type of logical encoding of node ids is commonly used in relational representations of an XML object model without requiring a relationship between an interval encoding representation and a physical storage address. In addition, such an encoding method is better suited for read-only documents. In order to deal with insertion and update operations, some sequence number spaces are typically left between consecutive nodes identifiers for insertions. This method produces a need to modify existing node identifiers when reserved sequence numbers are completely exhausted thus making it an expensive operation.
Another method of logically encoding node identifiers includes prefix encoding node identifiers. This method uses a concatenation of numbers (local identifiers) for nodes along the path from the root node to a particular node to generate a node identifier. Local identifiers are assigned to children of a parent are based on their sequence, with optionally reserved spaces in between local identifiers for future insertions. Prefix-encoded identifiers are used for document ordering, ancestor-descendant and parent-child relationships, and also for sub-document concurrency control. The method of encoding produces node identifiers that can be clustered. If a standard clustering index on a set of node identifiers is created, natural clustering will be in document order, or the order corresponding to a depth-first traversal. If a level number indicating the logical level on which a node is situated is prefixed to the front of a set of node identifiers and an index is created on it, clustering order will correspond to a breadth-first traversal. However, existing encoding methods create maintenance concerns when insertion or update operations are performed.
Existing encoding schemes are not well equipped to handle identifier maintenance issues in face of arbitrary insertions. Current research efforts are directed toward using statistics for optimal encoding and number space allocation for node insertion. However, statistics are not available for new documents during the initial phase of database population. In addition, the use of statistics to generate node identifiers cannot guarantee that an existing encoding method will be sufficient in the face of arbitrary insertions; it can only increase the probability that assigned node identifiers will not need to be changed.
U.S. Pat. No. 6,563,441 B1 discloses a program for decoding variable length codes and generating a binary tree that represents the coding scheme, a lookup table from the binary tree that can be used to decode variable-length codes having length less than or equal to a threshold length. The method comprises obtaining data that defines connections between a root node and a plurality of other nodes such that each of the other nodes is a child node that connects to one respective parent node, each parent node connects to at most two child nodes, and the connections between a parent node and its respective child nodes are associated with either of two binary values. A node that does not connect to any child node is a leaf node and the system obtains data that defines a respective value for each leaf node; generating a binary tree data structure representing the root node and the other nodes with branches having binary values and connecting the nodes according to the data that defines connections. No valid code is the prefix of any other valid code; and therefore the codes are such that a stream of encoded information can be parsed unambiguously into codes without requiring any special symbols or controls to mark the boundary between codes.
U.S. Pat. No. 6,587,057 B2 discloses high performance memory efficient variable-length coding decoder; with code words grouped by prefix and recorded to reduce the number of bits that must be matched, thus reducing the memory requirements.
U.S. Pat. No. 6,539,369 B2 discloses a method for storing sparse and dense sub-trees in a longest prefix match lookup table. The sparse sub-tree descriptor stores at least one node descriptor. The node descriptor describes a set of leaves in the sparse sub-tree having a common value; and the common value is encoded in the node descriptor using run length encoding.
U.S. Pat. No. 6,313,766 discloses a method for accelerating software decode of variable length encoded information; with logic device which outputs a fixed length value corresponding to a variable length code received as part of the bits stream of the variable length encoded information.
U.S. Pat. No. 5,883,589 discloses a variable length code construction apparatus; with a prefix processing unit for producing a codeword including at least “1”bit, the prefix of the codeword having continuous “0”bits.
U.S. 2002/0145545 A1 discloses entropy coding using adaptable prefix codes.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.
Therefore, there is a need in the art for a self-adaptive and efficient prefix-encoding method for stable node identifiers. The method of the present invention is self-adaptive in that shorter encodings are used for a smaller number of nodes and longer encodings are used for a larger number of nodes. It is not required to have knowledge of the number of nodes before node identifiers are assigned. The encoding method of the present invention allows for arbitrary insertion existing node identifiers do not have to be modified when a node is inserted to keep node identifiers in document order. It also follows a basic prefix encoding method, thus having all the properties of a prefix encoding. However, the method of the present invention is unique in that encodings of existing nodes are stable, meaning that they do not need to be changed, regardless of the number and placement of inserted nodes. This property holds true because a node identifier is not modeled as a fixed string of decimal numbers, but rather as a variable-length binary string.