Extensible markup language (XML) is a formalism to mark up text-based data with additional information to convey graphical layout information, semantic tags etc. XML allows elements to be specified and allows the nesting of such elements through the use of so-called document type definitions (DTD). Thus it is possible to describe a large variety of data using XML syntax. Partly as a consequence, XML and XML based DTDs have found wide-spread interest in the technical community, in particular for activities within the world wide web consortium (W3C) and the wireless application forum (WAP) where XML is used to specify data exchanged between at least two computing entities.
In addition to the DTD mechanism, the W3C has more recently been promoting XML Schema—a schema describing formalism for XML data. In addition to describing the data's element and attribute structure—akin to DTDs—XML Schema allows to constrain the textual data elements themselves, which may only be unconstrained strings when using DTDs.
One of the problems of XML based data, described by either DTDs or XML Schemas, is its verbosity. All data is textual in nature and thus has a low information density. For example, tags and attribute names represent elements of a finite, small vocabulary for which a plain text representation is inefficient. Moreover, the processing of textual XML data is space and time consuming.
Processing time may be attributed to two activities: 1) recognizing and verifying the structure of the data (i.e. parsing); and 2) interpreting of the individual data values. For instance, to encode a colour set of (red, blue, green) a computing entity would have to handle the strings “red”, “blue”, or “green” rather than manipulating a 2-bit binary value. Additionally, string operations are of linear complexity for comparison and searches of patterns within a string.
From U.S. Pat. No. 6,311,223, granted 30 Oct. 2001 to Boden et al., there is known a system involving compression of a markup language, particularly, hypertext markup language (HTML) by tokenization of tags and removal of comments.
Compression techniques for XML based data have been proposed and may be based on well-known string compression algorithms. An XML-specific proposal, known as wireless binary XML (WBXML), Wireless Application Protocol Binary XML Content Format Specification, Version 1.1, 16 Jun. 1999, WAP Forum, is based on tokenizing well-known strings present in the XML data and detecting repetition of multiply occurring strings within the data which are then grouped in a string lookup table. Whilst the mechanism of WBXML is efficient, not all the protocols based on XML DTDs also provide a WBXML encoding table, needed for the WBXML compression.
Other compression string compression algorithms are known, particularly Lempel-Ziv encoding, J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression”, IEEE Transactions on Information Theory, Vol. 23, pp. 337-342, 1977, and Huffmann encoding, Huffmann, D, “A Method for the Construction of Minimum Redundancy Codes”, Proc. IRE, vol. 40, pp. 1098-1101, September 1952. Of the above two string compression algorithms, Huffmann encoding provides the densest achievable compaction. However, Huffmann encoding requires a statistical analysis of the input data before the compression takes place. Furthermore both entities exchanging data items must agree in advance on the chosen, data dependent, encoding to exchange Huffmann-encoded data.
Whilst much XML data represents free-form data containing arbitrary values, some protocols use XML data to encode data which may only take a limited set of values. For example, the WAP forum-defined “push access protocol” (PAP), Wireless Application Protocol, Push Access Protocol Specification, November 1999, WAP Forum, uses an XML DTD to specify the protocol headers. More particularly, it is possible to associate a quality-of-service element with a push-message where the quality-of-service fields are represented as name-value pairs with both names and values represented as strings. Given the limited number of options for most values as well as the a priori set of known attribute names, such encoding is sub-optimal from an information theoretic point of view.
However the approaches, as described above, have disadvantages. For example, WBXML compression is not suitable for protocols based on XML DTDs that do not provide a WBXML encoding table since the encoding tables must be disseminated before communicating entities may exchange compressed XML. The densest achievable compaction for Huffmann encoding is limited by statistical analysis of the input data which is required before the compression takes place. Plain, textual XML representation of PAP protocol entities is sub-optimal from an information theoretic point of view since PAP protocol requires a set of textual strings to encode values from an often small and predefined value set.
A need therefore exists for a scheme for processing of markup language information wherein the above mentioned disadvantages may be alleviated.