1. Field of the Invention
The present invention relates generally to an XML compression technique capable of efficiently storing and managing data expressed in extensible markup language (reference: T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler, extensible markup language 1.0 (second addition). W3C) that is a standard for data representation and exchange on the Internet, and, more particularly, to a compression technique that performs compression using reverse arithmetic encoding and a type inference engine, thus allowing XML queries related to compressed XML data to be directly and efficiently processed.
2. Description of the Related Art
EXtensible Markup Language (XML) data is a collection in which the elements, each of which is expressed with a start tag and an end tag, are hierarchically nested. To search such XML data, XML query languages, such as XPath (reference: J. Clark and S. DeRose, XML path language (XPath) version 1.0, W3C) and XQuery (S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Simeon, and M. Stefanescu, Xquery 1.0: An XML query language, W3C), were proposed. Such query languages are based on path expressions consisting of the tags of XML data so as to search irregular XML data. Accordingly, it is important to support path expressions related to XML data.
Data compression techniques are divided into lossy compression and lossless compression according to data decompression capability. The conventional XML data compression techniques include XMill (reference: H. Liefke and D. Suciu, XMill: An Efficient Compressor for XML Data, ACM SIGMOD 2000) and XGrind (reference: P. M. Tolani and J. R. Haritsa, XGRIND: A Query-friendly XML Compressor, IEEE ICDE 2002).
XMill is a compression technique for minimizing the size of compressed XML data, and does not support the performance of queries related to compressed XML data. XMill manages the tags and attribute names of XML data, with the tags and the attribute names being physically separated from the data values thereof. Accordingly, the structure of compressed XML data is different from that of original XML data. Respective data values are classified according to the tags of corresponding elements, and stored in a data structure that is called a container. In this case, a user can classify data values in detail using path expressions. Furthermore, the tags and attribute names of XML data are compressed using a dictionary encoding technique. If there is a user defined encoding technique for a corresponding container, data values stored in the container are compressed using the user defined encoding technique. In this case, the dictionary encoding technique is a technique of assigning an integer value to each of the words of input data and replacing the words with unique integer values. Finally, the data is compressed once more using zlib that is well known as a data compression library. In this case, data values have been classified according to tags, so that the data values are similar in terms of syntax or semantics, thus exhibiting a superior compression ratio. However, there is a disadvantage in that data must be decompressed to perform queries.
XGrind is an XML compressor for supporting the performance of direct queries related to compressed XML data, and is a homomorphic compression technique in which compressed XML data maintains the structure of original XML data, unlike XMill. In XGrind, data values are compressed using Huffman encoding (reference: D. A. Huffman, A Method for the Construction of Minimal Redundancy Codes, The Institute of Radio Engineering, 1995) or dictionary encoding, while tags and attribute names are compressed using dictionary encoding. XGrind determines whether Huffman encoding or dictionary encoding is applied to the data value of an element having a certain tag, using a Document Type Definition (DTD) indicating information about the structure of XML data.
To process path expressions in XGrind, there is a burden of a query processor detecting a path from a root element to a corresponding element, that is, a sequence of tags, and examining whether the path meets a path expression whenever visiting each element. To perform a range query of searching for elements having certain ranges of data values, partial data decompression for the data values is required. The reason for this is that, when Huffman encoding or dictionary encoding is applied, the results of size comparison between encoded values may deviate from original data values.