The XML format is a syntax for defining computer languages, which makes it possible to create languages adapted to different uses that can however be processed by the same tools.
An XML document is composed of elements, each element beginning with an opening tag comprising the name of the element (for example: <tag>) and ending in a closing tag also comprising the name of the element (for example: </tag>). Each element can contain other elements or text data.
An element can also be specified by attributes, each attribute being defined by a name and having a value. The attributes are then placed in the opening tag of the element that they specify (for example: <tag attribute=“value”>).
XML syntax also makes it possible to define comments (for example: “<!--Comment-->”) and processing instructions, which may specify to a computer application which processing operations to apply to the XML document (for example (“<?myprocessing?>”).
In XML terminology, all the terms “element”, “attribute”, “text data”, “comment”, “processing instruction” and “escape section” are grouped together under the generic term “item”. In a more general context, all these terms (forming for example the element defined between an opening tag and a closing tag) can be grouped together under the generic term “node”.
To process an XML document, the latter must be read in memory. Two families of methods of reading an XML document exist.
The first family of methods consists of representing the entire XML document in memory, in the form of a tree. These methods afford easy and rapid access to any node or any part of the XML document, but require a large amount of memory space. One example of these methods is the DOM (“Document Object Model”) programming interface.
A second family of methods consists of representing each node of the XML document by one or more events. The entire XML document is then described by the succession of these events. These methods make it possible to process an XML document as it is read (streaming mode).
One advantage of these methods lies in the small amount of memory space that they require for processing a document. Nevertheless they require navigation in the document solely in the order of reading it. Examples of these methods are the SAX (“Simple API for XML”) and StAX (“Streaming API for XML”) programming interfaces.
The XML format has many advantages and has become a standard for storing data in a file or for exchanging data. First of all, the XML format makes it possible in particular to have available numerous tools for processing the files generated. Also, an XML document can be edited manually with a simple text editor. In addition, as an XML document contains its structure integrated in the data, this document is highly legible even without knowing its specification.
Nevertheless, the main drawback of XML syntax is being very prolix. Thus the size of an XML document may be several times greater than the intrinsic size of the data. This large size of the XML document also gives rise to a long processing time during the generation and in particular the reading of XML documents.
To mitigate these drawbacks, mechanisms have been put in place, the purpose of which is to code the content of the XML document in a more effective form, making it possible to reconstruct the XML document easily. However, the majority of these mechanisms do not keep all the advantages of the XML format. There exist nevertheless new formats that make it possible to store the data contained in an XML document. These various formats are grouped together under the term “Binary XML”.
Among these formats, EXI (“Efficient XML Interchange”) is currently being standardized by the W3C (“World Wide Web Consortium”, an organization producing standards for the web) and makes it possible to code an XML document in a binary form.
This format uses dictionaries for coding the various parts of an XML document.
Some of these dictionaries are said to be “global” in that they concern the coding of the whole of the document, such as for example the vocabulary dictionary for coding URIs (“Uniform Resource Identifiers”) or the global dictionary of values.
Other dictionaries are said to be “local”: for example, a vocabulary dictionary for the element local names is associated with each URI. In a similar manner, a dictionary of values is associated with each attribute qualified name. A local dictionary is thus used solely when the URI, the attribute qualified name, etc, associated with the dictionary concerns the portion of XML document to be coded. This local dictionary used is, at this time of use, the current dictionary of values.
Finally, dictionaries of local structures are also used for coding the structure of the XML document. These dictionaries make it possible to code the type of each item of the XML document: attribute, element opening tag, etc. These dictionaries of structures depend on the parent element of the item to be coded and may depend on the items preceding the item to be coded within this parent element. These dictionaries of structures are generally called “grammars” in the EXI specification.
Still according to the latter, the grammars are composed of a set of productions, each production comprising an XML event description, an associated coding value and the indication of the following grammar to be used (for coding the following event). Since one grammar is passed to from another by virtue of this indication, at a given moment in the coding or decoding processing operations according to the EXI specification, there is generally only one current grammar.
A grammar according to EXI may evolve. In a certain number of cases, after the occurrence of an XML event already described by a production of the grammar (if it is not described by a production, it cannot be encoded by the grammar), the grammar is modified in order to include a new production corresponding to this XML event. This production can either contain a more precise description of the event, reducing the number of items of information to be coded in order to represent the event, or have a more compact coding value.
The coding values or “numerical codes”, are expressed in the form of “priorities” generally having between 1 and 3 levels. Coding a coding value amounts to coding the values of its priority. According to the most advantageous coding mode in terms of compression, each level is coded in a minimum number of bits in order to be able to code the largest value of this level associated with a production of the grammar. For example, for a level taking values from 0 to 6, 3 coding bits are used.
In the remainder of the description, the term “dictionary” is used to designate generically the various dictionaries used during the coding or decoding of a document: vocabulary dictionary, dictionary of values or dictionary of structures.
Although the remainder of the description concentrates on the EXI format, as above, since the invention is particularly well suited to this format, the invention is not limited to this coding format. The invention can also apply to other binary XML formats, or be used between several binary XML formats.
By way of example, the Fast Infoset format, a binary ITU-T and ISO format, uses in particular binary indicators for describing the various nodes contained in the XML document, as well as index tables (dictionaries according to the terminology adopted above) for the names of elements, the names of attributes, the values of attributes and the text values.
In order to be above to adapt to different scenarios, the coding formats, and in particular the EXI format, propose several coding options.
Thus, for example, the local structure dictionaries can be created either dynamically during the coding of the document (or during its decoding) or in advance from a document structural description file (referred to as XML Schema).
The same applies for the dictionaries of local names of elements which are either created dynamically during the coding of the document or filled in, in advance, from an XML Schema.
XML Schemas are also the subject of a specification of the W3C and are provided for describing the structure of a family of XML documents. This description is itself produced in XML language.
This specification is divided into two parts, a first that corresponds to the description of the structure of an XML document and a second that corresponds to the description of the types of data that can be used for the contents of an XML document.
Thus an XML Schema makes it possible to describe the structure of an XML document by defining its name, the list of its attributes and their respective types, and the content of this element. This content can be composed of text, other elements/items or a mixture of the two.
The other elements are organized in groups, a group being able to contain other groups nested therein and define constraints on the order of appearance of the elements that it contains.
In addition, for each element, its number of occurrences within its own group can be defined. A group may, for example, be a sequence, in which all the elements appear in the order indicated in the XML Schema, a choice in which only one of the listed elements appears, or a complete group (“all”) in which all the elements appear once and once only, in any order.
Depending on whether they are created dynamically or from an XML Schema, the local dictionaries are different. They may not include the same entries (productions in the case of a grammar), or contain them in different orders, and in the case of structure dictionaries the number of dictionaries (grammars) corresponding to a given XML element may be different.
The prior constitution of the local structure dictionaries at the encoder or decoder makes it possible to accelerate the subsequent processing operations of encoding or decoding documents substantially in accordance with the description supplied in the XML Schema, since some entries have already been created.
FIG. 1 shows an extract of an XML Schema 1. The latter describes an element “person” representing a person. According to the XML Schema, this element comprises several sub-elements 2 as follows: first of all an obligatory sub-element “name”, and then one or more “address” sub-elements and finally zero or several “phone” sub-elements. In summary, any “person” element in accordance with the Schema comprises a “name” element necessarily followed by at least one “address” element. This “address” element can be followed by other “address” elements. These “address” elements can be followed either by one or more “phone” elements or by no other element.
FIG. 2 shows the four dictionaries of structures (grammars) 100 to 103, in accordance with the EXI specification, which are obtained from the XML Schema in FIG. 1.
For each dictionary 100 to 103 (“person_0”, “person_1” . . . ), the list of entries (productions) is specified as follows: the type of event corresponding to the entry (SE for “Start Element”, that is to say the start of a sub-element, the name of which is specified between parentheses, EE for “End Element”, that is to say for the end of the “person” element), the following dictionary to be used when the entry is used for coding (or decoding) and the index (coding value) for coding the use of the entry in the coded document.
Thus the first dictionary, “person_0”, contains a single entry, which corresponds to the start of a “name” element, the following dictionary to be used being the dictionary “person_1” and the index for coding the start of the “name” element being “0”.
In accordance with certain options of the EXI specification, the coding of this index, during an occurrence of the “name” element in an XML document to be coded, is carried out on a minimum number of bits. Thus in this case, as there is only one index in the dictionary of structures, this index is coded in 0 bits (since it is completely predictable). On the other hand, in the case of the third dictionary, “person_2”, as there are three indices, these indices are coded in 2 bits.
One drawback of this encoder or decoder initialization mechanism lies in the need for a large amount of memory space for storing all the structure dictionaries.
This is because, as can be seen in FIG. 2, the information (grammars) resulting from the conversion of the description data (XML Schema) at the decoder or encoder have entries appearing in several dictionaries, sometimes with different indices, which is in particular the case for the entry “SE (address)”, which appears in the dictionaries “person_1” and “person_2”, or for the entry “SE (phone)”.
A direct representation of the dictionaries of structures from the XML Schema requires, because of this, duplicating this information.
The publications US 2007/239,393 and “Type-based compression of XML Data” (Christopher League et al., March 2007) are also known, which describe the structure information of an XML Schema, not directly in the form of grammars but by means of a finite state machine.
In this finite state machine, the nodes contained in the element described by the XML Schema are described by transitions, while the states of the machine correspond to the positions between these nodes. Thus each node contained in an element is described by at least one transition, the number of transitions describing this node depending on the number of cases in which this node may appear. In addition, for each state, the number of transitions starting from this state corresponds to the number of nodes able to occur in this state.
With respect to the preceding description of the EXI specification, each state with the related transitions associated with it represents a structured dictionary. This solution therefore achieves a direct implementation of the EXI specification seen above.
During coding (or decoding), the finite state machine is used for coding the structure information: from a given state, the following node type is coded according to the number of efferent transitions of this state.
Because of the state/transition—structure dictionary correspondence, the representation of the content of an element by a finite state machine, that is to say the information converted at the coder/decoder using the XML Schema, corresponds to the representation of the structure dictionaries of the EXI format. Representation by means of a finite state machine therefore also has the drawback of requiring a large amount of memory space.
In general terms, the initial configuration of encoder or decoder according to the known solutions of the prior art generates a large number of structure dictionaries or equivalent, in which a great deal of information is common and numerous structures are redundant. The result is ineffective use of the memory.
The patent application US 2003/018,466 is also known, which describes an XML data encoding and decoding method starting from a DTD schema. However, in this method, several generators that require memory and processing resources are used to switch between the DTD schema data and an ASN abstract syntax type.
The invention aims to mitigate this problem by optimizing the memory space used during the configuration of the encoder or decoder.