1. Field of the Invention
The present invention relates to an efficient compression algorithm for XML (extensible mark-up language) documents, and more particularly to a computer system, method and computer-readable code for schema-driven compression of extensible mark-up language (XML) documents.
2. Description of the Related Art
The Binary XML Content Format Specification (e.g., WBXML which is an acronym for wireless application protocol binary XML) defines a compact binary representation of the Extensible Markup Language (XML). (“XML” is a trademark of Massachusetts Institute of Technology.) The binary XML content format is designed to reduce the transmission size of XML documents with no loss of functionality or semantic information.
For example, it preserves the element structure of XML, allowing a browser to skip unknown elements or attributes. More specifically, it encodes the tag names and the attributes names and values with tokens (e.g., a token may be a single byte). Tokens (e.g., application tokens) are split into a set of overlapping code spaces. A particular token's meaning is dependent on the context in which it is used. Tokens are organized in the following manner. That is, there are two classifications of tokens: global tokens and application tokens.
Global tokens are assigned a fixed set of codes in all contexts and are unambiguous in all situations. Global codes are used to encode inline data (e.g., strings, entitles, opaque data, etc.) and to encode a variety of miscellaneous control functions.
Application tokens have a context-dependent meaning and are split into two overlapping code spaces. These two code spaces are the tag code space and the attribute code space. A given token value (e.g., 0×99 representing a hexadecimal value; the decimal value corresponding to the hexadecimal value 0×99 is 153) will have a different meaning depending on whether it represents a token in the tag or attribute code space. The tag code space represents specific tag names. Each tag token is a single-byte code and represents a specific tag name (e.g., CARD).
The attribute code space is split into two numeric ranges representing attribute prefixes and attribute values respectively.
Each code space (e.g., for both tag and attribute code space) is further split into a series of 256 code pages. Code pages allow for future expansion of the well-known codes. A single token (e.g., SWITCH_PAGE) switches between the code pages. The definition of tag and attribute codes is document-type-specific. Global codes are divided between a generic set of codes common to all document types and a set reserved for document-type specific extensions.
Huffmann and Lempel Ziv (LZ77 and LZ78) algorithms ZLIB (Zip LIBrary) and GZIP (GNU (GNU's Not Unix) ZIP) are two implementations of these algorithms) are known for text data. However, as XML documents are compressed, the structural information is not necessarily maintained in the compressed form so that the documents cannot be easily reconstructed. Moreover, in applying these algorithms, some (if not all) structural information cannot be retrieved without prior decompression because the compressed stream in a flat byte stream.
Further, hitherto the present invention, separating markup (e.g., structure such as element names and attribute names and values) and non-markup (data), and compressing the non-markup component using ZLIB and the markup component using binary coding has not been performed.
Further, a binary encoding component which would retain the structure occupies approximately twice as much space as the ZLIB equivalent that loses structure. This is problematic so that there must be a tradeoff between compressing the structure component with a higher compression rate and retaining the structure.
That is, the exemplary data compression algorithms (Huffman, LZ77, LZ78, Millau (the inventive algorithm)) are lossless but traditional algorithms (Huffman, LZ) need prior decompression (a time costly operation) to retrieve the structure, whereas the inventive format does not need decompression to retrieve the structure encoded in binary format. Thus, prior to the invention, a tradeoff was required between compression rate and decompression time.
Further, the conventional methods perform poorly on relatively small documents (like eBusiness transactions) because they are designed to take advantage of the redundancy of the information which is not significant in small documents. They were not designed to take advantage of the structure. In contrast, as described below, the present invention is designed to take advantage of the structure described in the Document Type Definition (DTD) so it performs well on small documents as well as large documents.