1. Field
The present invention concerns a processing method and device for document coding. It applies, in particular, to the XML language (XML being the acronym for “Extensible Markup Language”). This language is a syntax for defining computer languages. Thus XML makes it possible to create languages that are adapted for different uses but which may be processed by the same tools.
2. Description of Related Art
An XML document is composed of elements, each element starting with an opening tag comprising the name of the element (for example: <tag>) and ending with a closing tag which also comprises the name of the element (for example </tag>). Each element may contain other elements, termed “child elements” (a filiation terminology, “parent”, “child”, being used to describe the relationships between the nested elements) or text data.
Furthermore, an element may be specified by attributes, each attribute being defined by a name and having a value. The attributes are placed in the opening tag of the element they specify (for example <tag attribute=“value”>).
XML syntax also makes it possible to define comments (for example <!--Comment-->) and processing instructions, which may specify to a computer application the processing operations to apply to the XML document (for example “<?myprocessing?>”), as well as escape sections which make it possible to avoid a section of text being interpreted as a tag when it has the form thereof (for example “<![CDATA[<text>Escape</text>]]>” in which <text> is recognized as a string and not as a tag).
In XML terminology, the set of the terms “element”, “attribute”, “text data”, “comment”, “processing instruction” and “escape section” are grouped together under the generic name of “item”. In a more general context, all these terms (forming the element defined between an opening tag and a closing tag) may be grouped together under the generic name of “node”.
Several different languages based on XML may contain elements of the same name. To be able to mix several different languages, an addition has been made to XML syntax making it possible to define “Namespaces”. Two elements are identical only if they have the same name and are situated in the same namespace. A namespace is defined by a URI (acronym for “Uniform Resource Identifier”), for example “http://canon.crf.fr/xml/mylanguage”. The use of a namespace in an XML document is via the definition of a prefix which is a shortcut to the URI of that namespace. This prefix is defined using a specific attribute (for example “xmlns:ml=“http://canon.crf.fr/xml/mylanguage” associates the prefix “ml” with the URI “http://canon.crf.fr/xml/mylanguage”). Next, the namespace of an element or of an attribute is specified by preceding its name with the prefix associated with the namespace followed by “:” (for example “<ml:tag ml:attribute=“value”>” indicates that the element tag arises from the namespace ml and that the same applies for the attribute attribute).
The XML Schema standard defines a language making it possible to describe the structure of a set of XML documents. An XML Schema document is an XML document, and describes all the elements and attributes that may be present in an XML document in accordance with that XML Schema document, as well as the relationships between those elements and those attributes.
Other systems enable the structure of a set of XML documents to be described, such as DTDs (acronym for “Document Type Definition”) or such as the Relax NG language.
XML has numerous advantages and has become a language of reference for storing data in a file or for exchanging data. XML makes it possible in particular to have numerous tools for processing the files generated. Furthermore, an XML document may be manually edited with a simple text editor. Moreover, as an XML document contains its structure integrated with the data, such a document is very readable even without knowing the specification.
The main drawback of the XML syntax is to be very prolix. Thus the size of an XML document may be several times greater than the inherent size of the data. This large size of the XML documents thus leads to a long processing time when XML documents are generated and read. It also leads to a long transmission time.
To mitigate these drawbacks, other methods for coding an XML document have been sought. The object of these methods is to code the content of the XML document in a more efficient form, while enabling the XML document to be easily reconstructed. However, most of these methods do not maintain all the advantages of the XML format.
Among these methods, the simplest consists of coding the structural data in a binary format instead of using a text format. Furthermore, the redundancy of the structural information in the XML format may be eliminated or at least reduced (for example, it is not necessarily useful to specify the name of the element in the opening tag and the closing tag).
Another method is to use an index table, in particular for the names of elements and attributes which are generally repeated in an XML document. Thus, at the first occurrence of an element name, it is coded normally in the file and an index is associated with it. Next, for the following occurrences of this element name, the index is used instead of the complete string, reducing the size of the document generated, but also facilitating the reading (there is no longer need to read the complete string in the file, and furthermore, the determination of the read element may be carried out by a comparison of integers instead of a comparison of strings of characters).
Lastly, beyond these elementary methods, there are more highly developed methods consisting in particular of taking into account a higher number of pieces of structural information of the document in order to further compress the data.
Among others, the case of “Efficient XML” may be cited, which is a format used as a basis for the standardization of a binary XML format by the EXI working group of W3C (EXI being an acronym for “Efficient XML Interchange” and W3C being an acronym for “World Wide Web Consortium” which is an organization producing standards for the Web) which takes into account the order of appearance of the different items within a document to construct a grammar which makes it possible to code the most frequent items using a small number of bits.
The binary XML format “Fast Infoset” may also be mentioned, which is specified by the standard ITU-T Rec X.891|ISO/IEC 24824-1, which provides a more compact representation of an XML document by using binary codes of items and index tables. In this format, the types of items are described as lists which use binary codes of variable length. Fast Infoset intensively uses indexing techniques by creating tables for specific sets of XML information. These tables make it possible to code a given piece of information (an item for example) in a literal manner (for example according to one of the character coding formats UTF8 or UTF16, where UTF is an acronym for “UCS transformation format”8 bits) the first time that piece of information is encountered during the coding of the document. This piece of information is then added to the indexing table and associated with an index.
Later, when that piece of information is detected again in the XML document, the corresponding index is retrieved from the indexing table and the value of that index is then coded instead of the piece of information. A notable compression of the data may thus be obtained.
A certain number of indexing tables may be noted, among which are:                two tables respectively indexing the prefixes and the URIs in order to define the namespaces;        two specific tables respectively indexing the attribute values and the text node values;        a table indexing the local names of attributes and elements;        two specific tables respectively indexing the qualified names (which group together for example a prefix, a URI and a local name) of elements, and the qualified names of attributes.        
It may be noted that the Fast Infoset standard enables the coder to decide whether a particular attribute value or text node value is to be indexed, for example depending on the length of the value or of the string This makes it possible in particular to limit the size of the memory used by the coder. The decision whether or not to index an attribute value or a text node is then coded in the Fast Infoset stream to enable the associated decoder to index or not index the values to decode.
Returning to “Efficient XML”, it is noted that this standard uses a set of grammars to code an XML document.
To be able to code the items comprised in an XML document, the Efficient XML specification divides each of the “nodes” into elementary parts called events, for example an opening tag. These events are similar to those generated by XML parsers working in streaming mode, that is to say representing an XML document as a data stream, such as the SAX parsers (SAX being the acronym for “Simple API for XML”). Thus, for example, in the Efficient XML specification, an XML node is represented by a start element event (opening tag), a set of events representing its content and an end element event.
When an event is composed of a single item, it is noted that an assimilation of the event to the item must be made. Thus, for the following portion of the description, event and item will be assimilated.
A grammar is composed of a set of productions, each production comprising an XML event (or item) description, an associated coding value and the statement of the following grammar to use. To code an XML event using a grammar, the production containing the most precise description of the XML event is used. The coding value contained in that production is used to represent the event, and the information contained in the event and not described in the production is coded.
Grammars and productions are thus viewed as coding structures of the events or items that they propose to code.
A grammar according to Efficient is upgradeable. In a certain number of cases, after the occurrence of an XML event already described by a production of the grammar (if it is not described by a production, it cannot be coded by the grammar), the grammar is modified to include a new more efficient production corresponding to that XML event. This production may either contain a more precise description of the event, reducing the number of pieces of information to code to represent the event, or have a more compact coding value.
The coding values, or “codes”, are expressed in the form of “priorities” having, generally, between 1 and 3 levels. Coding a coding value amounts to coding the values of its priority. Each level is coded over the minimum number of bits to be able to code the highest value of that level associated with a production of the grammar. For example, for a level taking values from 0 to 6, 3 coding bits are used.
To code an XML document, a set of grammars is used. A few grammars serve for coding the actual structure of the XML document. Furthermore, for each type of XML element present in the document (a type of XML element being a set of elements having the same name), a set of grammars is used to code the XML elements of that type.
The rules of grammars used may either be generic rules, common to all the XML documents and constructed on the basis of the XML syntax, or be rules specific to a type of document, constructed on the basis of an XML Schema describing the structure of that type of document.
On decoding, the inverse process is used: the coding value is extracted and makes it possible to identify the coded XML event, as well as the complementary information to decode.
Furthermore, on decoding, the same grammar evolution rules are used, making it possible at any time to have a set of grammar rules identical to that which was used on coding.
By way of example, the following XML fragment is used to describe the coding of an XML document using the Efficient XML specification:
<person> <firstname>John</firstname> <lastname>Smith</lastname></person>
As the coder has not yet encountered the “person” element or event, a grammar “by default” is created for that element. This is a grammar only containing generic productions. During the coding of the “person” element, new productions are created and inserted to render the grammar linked to the “person” element more effective. The grammar by default that is used to code the content of the “person” element is the following (in simplified manner relative to the Efficient XML specification):
ElementContent:
EE0SE (*) ElementContent1.0CH ElementContent1.1
“EE” corresponds to the end element event, “SE (*)” corresponds to some particular start element event (generic, the name is thus not specified), and “CH” corresponds to a text content event.
The grammar thus created is stored in a table, for example in volatile memory of the coder.
On coding, after having received the event corresponding to the start “person” element, “SE (person)” and having coded it, for example literally, the coder selects the coding grammar for the content of the “person” element, described above.
Next, the coder receives the event corresponding to the start “firstname” element, “SE (firstname)”. The production which corresponds to that event in the above grammar is the second:
SE (*) ElementContent1.0
The coder will thus code the priority “1.0”. As the first level of priority comprises two separate values (“0” and “1”) from among the productions of the grammar, that level may be coded over one bit, with the value “1”. Similarly, the second level of priority comprises two separate values and may be coded over one bit, with the value “0”. The priority “1.0” is thus coded here with the two bits “10”.
Next, as the production does not specify the name of the element, “firstname” is coded, for example laterally, using the production.
CH ElementContent1.1
The coding of the content of “firstname” is then continued. To that end, the rule associated with that element is searched for. As no “firstname” element has been encountered, a “firstname” grammar is created from the grammar by default. The “firstname” element contains a text node as its sole child. Once this text node has been coded, the grammar of “firstname” is updated by inserting a production text CH.
“firstname” grammar
ElementContent:
Characters0EE1SE (*) ElementContent2.0CH ElementContent2.1
Once the content of “firstname” has been coded, the coder modifies the grammar associated with the “person” element to adapt the grammar to the XML data encountered. For this, a new production is added to the grammar, this production corresponding to the start “firstname” element. The priority “0” is associated with this production, and the other priorities are offset to maintain the uniqueness of the priorities. It is noted here that as the decoder acts symmetrically, it will be capable of performing similar offsets of priorities (or indices) progressively with the advancement of the decoding of the data received. The grammar thus becomes:
“person” grammar
ElementContent:
SE (firstname) ElementContent0EE1SE (*) ElementContent2.0CH ElementContent2.1
The following event of the XML fragment to code is the start of the “lastname” element. As for “firstname”, this element is coded using the production:
SE (*) ElementContent2.0
since no production corresponding to the “lastname” element has been found.
As the first level of priority now has three possible values, it is coded over 2 bits, with the value “2”. The second level of priority is still coded over a single bit. The priority “2.0” is thus coded here with the three bits “100”.
The name of the element, “lastname”, is then coded for example literally in binary. Next the content of “lastname” is coded with the aid of the grammar associated with the “lastname” element, to be created if necessary at the time of the first iteration, in similar manner to that described above for “firstname”
Next, the “person” grammar is modified to add thereto a production corresponding to the start of the “lastname” element and it thus becomes:
“person” grammar
ElementContent:
SE (lastname) ElementContent0SE (firstname) ElementContent1EE2SE (*) ElementContent3.0CH ElementContent3.1
The end element event, corresponding to the end of the “person” element, is then coded, using the production:
EE2
It is to be noted that all the productions of the grammar; with the exception of this last production, comprise the description of an event, the associated code and the following grammar to use. This following grammar is that used to continue the coding after the coding of the event included in the production.
However, in the case of an event describing a start element, the grammars specific to that element are used to code the content of the element. The following grammar indicated in the production comprising the start element event is used for the coding after the end of that element.
Thus, the production comprising the end element event does not contain any following grammar: the grammar to use to code the following portion of the document is that which had been indicated by the grammar of the parent element in the production used to code the start event of that element.
If, further on in the XML document, the coder encounters another similar “person” element, that element will be coded on the basis of that grammar. Thus the first event corresponding to the content of the “person” element is the start event of the “firstname” element. This element is coded with the production:
SE (firstname) ElementContent1
It is noted that the production
SE (*) ElementContent3.0
also corresponds to that event, but is less precise (it does not specify the “firstname” name of the element). It is thus the first production which is used for an increased coding efficiency.
The coder thus codes the priority of this production, that is to say the value “1”, which is coded over two bits (as it takes the values from 0 to 3), i.e. “01”. There is no need to code the name of the element, since it is specified by the production and arises from the initial literal coding when the “firstname” element was encountered for the first time.
The coder next codes the content of the “firstname” element.
As a production specific to the start event of the “firstname” element already exists in the grammar, it is not necessary to add a new production to the grammar.
The coder next codes the start event of the “lastname” element similarly, by solely coding the priority “0” with the two bits “00”.
Thus, for the coding of the second “person” element similar to the first, the code generated is more compact, since it is no longer necessary to code the name of the elements contained in “person”, either literally (by coding the entirety of the string), or even using an index.
A point that is common to the Fast Infoset and Efficient XML methods is the use of coding tables, respectively indexing tables and grammar/production tables, which can be upgraded and kept up to date by the coder to describe each of the elements of the data to code. In the remainder of the present document, these tables will be referred to by the term coding tables without distinguishing between them. The coding tables are constituted by coding structures associating at least one coding value with an element.
Whether it be for one or the other of these two coding methods, the coding of an XML document requires several processing operations that are costly in time and machine resources, such as:                the literal coding of XML strings, for example prefixes, local names or values, in UTF8 or UTF16 format;        searching, in the coding tables, for the indexes corresponding to a processed piece of XML information (or item or event)        constructing and updating the coding tables, for example based on a single grammar by default.        
It is also noted that these processing costs multiply when the number of documents to code is multiplied.