Recently, attention has been drawn to XML as a data expression means for use on the Internet. XML is an extensible meta language, and a user can uniquely define its grammar. In addition, while XML can provide logical meaning for each element, it is much easier to use than HTML (Hypertext Markup Language) for data processing. It is therefore anticipated that XML will become the standard expression method and will be used for the structural languages that will be employed for the exchange of e-commerce documents, for example. Note that the specification for XML is contained in “W3C.Extensible Markup Language (XML) 1.0, 1998, http://www.w3.org/TR/REC-xml”.
Since characters are used to write XML data, its readability is high, as is its redundancy. Specifically, the meaning of an element is written mainly between start and end tags, and can be easily understood merely by referring to the contents, which, to reiterate, are written using characters. However, as characters are used to write all the contents, the total number of characters used is increased, and overall, the amount of data (the character count) required for an XML document is also increased. Thus, since a large number of characters are used, either a large memory capacity is required for the storage of data, and/or accompanying the increase in the amount of data, there is an increase in physical labor and time costs when the data is transferred via a network. Therefore, it would be convenient were XML data encoded (or compressed) to shorten the length of the code employed.
A variety of well known data compression methods are presently available, and include: run-length coding, Huffman coding, arithmetic coding and LZ77. Example in-detail descriptions of these compression methods may be found in: “A method for the construction of minimum-redundancy codes”, Huffman, D. A., Proc. of the IRE, September, 1952; “The Data Compression Book”, Mark Nelson and Jean Loup Gailly, Second Edition, M&T Books 1996; and “A universal algorithm for sequential data compression”, Jacob Ziv and Abraham Lempel, IEEE Transactions on Information Theory, May, 1997.
However, these compression methods were not specially prepared XML, and when used for XML data, compression efficiency is not always high. Example specialized compression methods for XML data are: XMill, described in “XMill: an Efficient Compressor for XML data, 1999, http://www.research.att.com/sw/tools/xmill/”, D. Suciu and H. Liefke; XMLZip, described in “XMLZip, 1999, http://www.xmls.com/products/xmlzip/xmlsip.html”, XML Solutions Corp.; and XComp, described in “Study for an XML document compression algorithm using DTD”, Kousaku Ikawa, graduation thesis prepared for the Information engineering course given by the Technology department of the Tokyo Institute of Technology, February, 2000.
According to the XMill reference, the content (text) portion of each element is extracted from XML data, and this extracted portion is referred to as a container. Then the structural portion is encoded using numerals, and subsequently, the text portion for each container is compressed using a compression method such as LZ77. Basically, data compression can be performed by an application without additional information, such as parameters, being required. As needed, a compression method for each container can be designated by setting a parameter, a process that increases compression efficiency. Further, since C is used to implement XMill, the compression speed is high.
According to the XMLZip reference, the depth of a root element is designated, and the designated portion is separated from a document element, following which ZIP is used to compress the remaining portion. Therefore, since the root element is not encoded operations on it can be performed directly. Further, since only the portion that is not so used is compressed, rapid document access is possible. It should be noted, however, that the compression efficiency provided by XMLZip is lower than that provided by XMill.
According to the XComp reference, of the structural portions that constitute XML data, a portion that is uniquely determined by employing DTD (Document Type Definition) is not encoded, and only that portion which can not be uniquely determined is compressed. Compression of the text portion is performed in the same manner as it is for XMill. That is, for data compression, the following procedures are performed. (1) XML data is divided into structure and content; (2) DTD is used to generate a push-down automaton (PDA); (3) PDA is used to generate an encoding transducer for encoding the structural portion; (4) numbers that are allocated for the individual nodes of the encoding transducer are output by continuously transferring the automaton while the structure is encoded; and (5) a method such as LZ77 is used to compress the code obtained for the structure and the contents of the elements, following which the compressed XML document is output.
Of the specialized compression methods employed for XML documents, a comparatively higher compression efficiency can be obtained with XComp, which does not encode part of he structural portion.
Even though the compression of XML documents, XComp is the superior method. However, according to a study performed by the present inventor, when XComp is used to process XML data having a specific structure, compression efficiency is reduced. That is, when the “?” operator or the “*” operator (includes the “+” operator) are employed for elements, compression efficiency is reduced.
The “?” operator is the one that is attached to the child element of a specific element in an element type declaration when the child element does not appear or appears only one time. According to XComp, when the “?” operator appears the “?” operator is represented by multiple choices each of which is shifted from one specific state to another, and an index is provided for the choice. Thereafter, when XComp is executed, an index provided for a selected choice is output, and since the number of available choices varies in consonance with how many “?” operators follow an initial “?” operator, when n “?” operators persist, for example, n+1 choices are present for the first “?” operator. Therefore, n+1 indexes are required, and O(log n) bits are needed to represent one index. Thus, when all the elements for which the “?” operators are employed, multiple indexes are enumerated, and since O(log n) bits are required to represent one index, O(nlog n) bits are required to represent all the indexes.
The “*” operator is the one that is attached to the child element of a specific element in an element type declaration when the child element does not appear or appears more than once. And the “+” operator is the one that is attached to the child element of a specific element in an element type declaration when the child element appears one or more times. Therefore, according to XComp, when the “*” operator (or the “+” operator) appears, the “*” (or the “+”) operator is represented by two choices one of which is maintained in the same state or the other of which is shifted to different state, and an index is provided for each choice. Upon the execution of XComp, the index provided for a selected index is output, and when the multiple elements for which “*” operator is employed are present, multiple like indexes are enumerated. Thus, since the number of indexes is proportional to the number of elements that are present, when n elements are present, O(n) bits are required to represent all the indexes.
That is, with XComp, the number of bits of code is increased to represent a specific portion of the structural portions that are not uniquely determined by DTD, and a satisfactory compression efficiency can not always be obtained.