When data is transmitted between a sender and a recipient (e.g. a server and a client) over a network, both the sender and recipient must know the format of the data being transmitted before the transmission takes place. For example, if the sender sends data in a form for a specific database, in order to use the data, the recipient must know what the database format being used is, and must know details about that format. If the recipient does not know what format was being used or the details of that format, data sent properly on the sender's end would be unrecognizable on the recipient's end.
As an example, a database format may comprise a series of records, where each record contains a record number of a certain size, followed by a last name field of a certain size, a first name field of a certain size, and a date field of a certain size. A header might precede these records. However, even if the sender sends data adhering perfectly to the format, unless the recipient knows the format, there is no way for the recipient to understand the data correctly.
To ensure that both the sender and the recipient have the necessary information about format, often they need to be running not only the same application, but the same version of the application. For example, if a sender sent data from a newer version of a database application to a recipient running an older version, the recipient's version may not recognize the format and as discussed above data may be lost or useless.
To help solve these problems and increase flexibility in transmissions, extensible markup language (XML), a markup language based on Standard Generalized Markup Language (SGML), was developed. A markup language is a language that allows content to be provided along with meta-content such as style, syntax, and semantic information in a structured way. XML is termed extensible because it is not a fixed format markup language. HTML (hypertext markup language) is a fixed format markup language, defining one format. Rather, XML is a markup language which is actually a metaformat, a language which allows the user to describe other formats. This allows a user to design a markup language and then to express it in XML. Thus XML provides a flexible standardized data storage format that allows flexibility in format and thus can facilitate interaction between sender and recipient even in the absence of pre-agreement on a strict format. To accomplish this, XML uses a text based tag system similar to (HTML) to describe and store data in a structured manner. For example, a database entry for an employee record might be represented in XML format as follows:                <employee>                    <firstname>John</firstname>            <lastname>Smith</lastname>                        </employee>This XML data includes two kinds of elements—tag elements, which begin and end with angled brackets (e.g. start tags such as “<firstname>” and end tags such as “</firstname>”) and data elements, (e.g. “John”). As shown, in an XML document, start and end tags can be nested within other start and end tags. All elements that occur within a particular element have their start and end tags occur before the end tag of that particular element. This defines a tree-like structure.        
The example XML above includes data elements “John” and “Smith” but also includes information (in the tag elements) indicating that data element “John” is a firstname, and that it is also part, along with lastname “Smith” of an employee record. If a sender transmits this XML file, any applications that recognize XML would be able to read this employee record, retrieve the data and understand its components.
While XML does not require a recipient to know which file format is being used and the details of the file format, it does have drawbacks. First, the file being sent is extremely bulky due to the large amount of tag elements used to describe the data. In fact, XML files can average a size of 2-10 times larger than a normal data file. These larger file sizes slow down the transmission time of data being sent and also require longer processing times. Therefore, transmitting and consuming XML can be very expensive.
To balance the competing interests of flexibility with faster transmission and small file size, some techniques referred to as binary XML can be used. Although the different binary XML techniques may vary depending on the techniques involved, two features are common in each binary XML format.
First, binary XML formats stream binary values rather than character-based values. Second, binary XML formats “tokenize” the XML tags by replacing the tag with a shorter token. For example, a binary XML format could assign the following binary representations for the tags shown above:
1: <employee>
2: </employee>
3: <firstname>
4: </firstname>
5: <lastname>
6: </lastname>
The record shown above could then be rendered as:
1 3 John 4 5 Smith 6 2
(The numbers shown above would be rendered in binary form; indentation is not meaningful but merely used to enhance comprehension when the markup-language document is displayed.) The substitution of such token representations for the text based tag results in a compressed file can yield an XML file which may be one-quarter or one-third of the size of the original XML file. The tokenization of tags occurs either according to a certain pre-defined token/tag substitutions (known to both sender and recipient, known as a “static dictionary”) or according to definitions which are sent as part of the file transmitted (such transmitted definitions known as a “dynamic dictionary”)
Although the file size is smaller, there are still drawbacks to binary XML techniques. First, there may be redundant substitutions which make the technique inefficient. For example, if a number is used as a tag in an uncompressed XML file, it may be encoded to a different number and then must be decoded, for no savings in space but a cost in encoding/decoding, when using binary XML. In addition, the data, even when using a binary XML technique, is not fully compressed to the smallest file size because many tags are repeated. This can be illustrated by the case in which many data records which use the same tags are contained in a single XML file. In such a case, even though a text based tag like <lastname> may be replaced by a numeric value when encoded, there will still be multiple instances of the same tag being repeated.
Thus, there is a need for a technique to encode data more efficiently and into smaller file sizes.