XML (acronym for “Extensible Markup Language”) is a syntax for defining computer languages. Thus XML makes it possible to create languages that are adapted for different uses but which may be processed by the same tools. An XML document is composed of elements, each element starting with an opening tag comprising the name of the element (for example: “<tag>”) and ending with a closing tag which also comprises the name of the element (for example “<tag>”). Each element can contain other elements or text data. Furthermore, an element may be specified by attributes, each attribute being defined by a name and having a value. The attributes are placed in the opening tag of the element they specify (for example ‘<tag attribute“value”>’). XML syntax also makes it possible to define comments (for example “<!—Comment—>”) and processing instructions, which may specify to a computer application what processing operations to apply to the XML document (for example “<?myprocessing?>”). >>).
The set of the elements, attributes, text data, comments and processing instructions are grouped together under the generic name of “node”.
Several different XML languages may contain elements of the same name. To be able to mix several different XML languages, an addition has been made to XML syntax making it possible to define “Namespaces”. Two elements are identical only if they have the same name and are situated in the same namespace. A namespace is defined by a URI (acronym for “Uniform Resource Identifier”), for example “http://canon.crf.fr/xml/mylanguage”. The use of a namespace in an XML document is via the definition of a prefix which is a shortcut to the URI of that namespace. This prefix is defined using a specific attribute (for example “xmins:ml=”http://canon.crf.fr/xml/mylanguage” associates the prefix “ml” with the URI “http://canon.crf.fr/xml/mylanguage”). Next, the namespace of an element or of an attribute is specified by having its name preceded by the prefix associated with the namespace followed by ‘:’ (for example ‘<ml:balise ml:attribut=“valeur”>’).
XML has numerous advantages and has become a standard for storing data in a file or for exchanging data. XML makes it possible in particular to have numerous tools for processing the files generated. Furthermore, an XML document may be manually edited with a simple text editor. Moreover, as an XML document contains its structure integrated with the data, such a document is very readable even without knowing the specification.
The main drawback of the XML syntax is to be very prolix. Thus the size of an XML document may be several times greater than the inherent size of the data. This large size of XML documents thus leads to a long processing time when XML documents are generated and especially when they are read.
To mitigate these drawbacks, other methods for coding an XML document have been sought. The object of these methods is to code the content of the XML document in a more efficient form, but enabling the XML document to be easily reconstructed. However, most of these methods do not maintain all the advantages of the XML format. Among these methods, the simplest consists of coding the structural data in a binary format instead of using a text format. Furthermore, the redundancy of the structural information in the XML format may be eliminated or at least reduced (for example, it is not necessarily useful to specify the name of the element in the opening tag and the closing tag).
Another method is to use an index table, in particular for the names of elements and attributes which are generally repeated in an XML document. Thus, at the first occurrence of an element name, it is coded normally in the file and an index is associated with it. Next, for the following occurrences of this element name, the index is used instead of the complete string, reducing the size of the document generated, but also facilitating the reading (there is no longer need to read the complete string in the file, and furthermore, the determination of the read element may be carried out by a comparison of integers instead of a comparison of strings of characters).
A second set of methods relies on the use of patterns detected in the XML document to code. These patterns represent pieces of structural information and certain pieces of content information of the XML document. The object of these methods is to code the repeated patterns in the XML document to avoid coding the same information several times.
These coding methods are efficient for XML documents containing numerous repetitions of structures that are identical or very similar. However, these methods of pattern creation are not optimal for the coding of an XML document, in terms of the size of the coded document.