The present invention concerns a method, a device and a computer program for generating reference patterns able to represent hierarchized data. The invention also concerns a method, a device and a computer program for coding hierarchized data, in particular stored in a document written in a markup language on the basis of reference patterns. The invention also concerns a method, a device and a computer program for decoding coded hierarchized data.
Numerous applications manipulate hierarchically structured data, also termed ‘hierarchized data.’ A document of hierarchized data incorporates two types of information: a first type of information informing as to the structure of the document and a second type of document informing as to the actual content of the data.
The information of the first type, referred to as ‘structural information’, is all the information that serves to hierarchize the data, as well as the information serving to describe the type of value or instance taken by the data of the document. The information of the second type, called ‘content information’, represents the values or instances taken by the data of the document.
The link between the structural information and the content information depends on the language used for hierarchizing the data.
There exist several ways of describing a hierarchized data structure. The most usual one uses the XML markup language, the acronym for ‘Extensible Markup Language’, that is to say an extensible markup language. This language is standardized by the W3C standardization committee (a description of the language can be found in the website at w3.org in the subdirectory “REC-xml” of subdirectory “TR”). XML is being used more and more for storing and transmitting digital data.
In practice, XML is a format for describing data, not a format for representing or displaying data.
The XML language defines a particular syntax for mixing the structural information and content information. The XML language defines several types of item for describing the structural information and content information. According to this syntax, a node, termed an ‘element’, is defined by an opening tag, a closing tag and an identifier. Each element can contain other elements or text data.
A leaf item, that is to say an item other than an element, usually represents content and can for example be text, a comment (for example: ‘<!—comment—>’), a processing instruction (for example: ‘<?my_processing?>’) or an attribute. The attribute is an item located in the opening tag of an element and, apart from the actual content of the attribute, contains an identifier to define it (for example: ‘attribute tag=“value”’).
XML is a syntax making it possible to define new languages. Thus it is made possible to define a plurality of XML languages that can be processed using generic tools.
In addition, XML syntax makes it possible to structure data, which makes it possible to produce documents containing the structural descriptions of the data.
Finally, XML syntax is textual and can be read or written easily by a user.
Several different XML languages can contain elements with the same name. Thus, in order to be able to mix several different XML languages, XML syntax makes it possible to define namespaces. In this way, two elements are identical if they have the same name and are situated in the same namespace.
A namespace is defined by a uniform resource identifier, also called URI (Uniform Resource Identifier), for example: ‘http://canon.crf.fr/xml/monlangage’.
The use of a namespace in an XML document is achieved by defining a prefix that is a shortcut to the uniform resource identifier of this namespace.
This prefix is defined by means of a specific attribute. For example, the expression ‘xmlns:ml=“http://canon.crf.fr/xml/monlangage’ associates the prefix ‘ml’ with the uniform resource identifier ‘http://canon.crf.fr/xml/monlangage’.
Next, the namespace of an element or attribute is specified by preceding the name with the prefix associated with the namespace followed by a ‘colon’ as illustrated in the following example: ‘<ml:balise ml:attribut=“valeur”>’.
An XML document format description language making it possible to define the structure of an XML document is for example the language called XML Schema.
An XML schema is itself a language using XML syntax making it possible to define XML languages. It thus makes it possible to define, for an XML language, the elements used by the language, the attributes that these elements contain, their arrangement, etc.
An XML schema therefore defines the syntax of an XML language or a part of a language. The schema defines the structure of the hierarchized data contained in documents written in XML language. In particular, for each element of the XML language, the XML schema defines the name, the namespace, the content of the element and the list of the attributes of the element, specifying in particular whether or not an attribute is obligatory, and whether other attributes can be added as well as the type of content of each attribute. The content of an element may for example be data, sub-elements or a combination of the two.
Thus an XML schema is a set of definitions, each definition corresponding to an XML item. These definitions are connected together either by being included in one another or by using references. Each definition specifies not only the content of an XML item but also its relationships with the other close XML items (for example the number of instances possible for this XML item, the possibility of co-occurrence of an instance of this definition with an instance of another definition, etc).
The schema can define the content of an element more precisely by specifying a type for the content. In the case of data, the type of the element corresponds to the type of the data, for example character string, integer, etc. In the case of sub-elements, the type of the element defines the sub-elements present, their number and their order.
An XML schema can not only be used to define the syntax of an XML language but also makes it possible to verify that a document written in XML language complies with the syntax of the XML language to which it belongs. This verification process is called validation. It makes it possible to prevent an application processing an erroneous document.
Markup languages, in particular the XML language, are used to store data in a file or to exchange data. It makes it possible in particular to have available numerous tools for processing the files generated. In addition, a document written for example in XML can be edited manually with a simple text editor. Moreover, given that a document written in markup language, for example XML, contains its structure integrated in the data, this document is made legible without even knowing its specification.
However, XML syntax is very prolix. Thus the size of an XML document can be several times greater than the intrinsic size of the data. This large size of XML documents therefore gives rise to a long processing time when such documents are generated and in particular when XML documents are read.
Various methods are known for compressing a document without losing data.
Thus the ‘zip’ or ‘gzip’ compression methods make it possible to code a document in a compressed form that uses less memory space than the original document. These compression methods are reversible and it is therefore possible to find the original document again. These methods are based on the algorithm called ‘DEFLATE’ defined by the document RFC 1951 and accessible at the following Internet address: http://www.ieff.org/rfc/rfc1951.txt.
The ‘DEFLATE’ algorithm is based on the detection of repetitions in order to reduce the size of the coded data. Thus, when the algorithm detects that a data sequence has already appeared in the document, the algorithm stores a reference to this data sequence instead of storing the data sequence. In this way, the coding of several repetitions of the same data sequence is effective.
The ‘zip’ or ‘gzip’ compression methods do however have a certain number of disadvantages in compressing XML files. This is because these methods do not have knowledge of the XML syntax and can in no way exploit the specificities of this language in order to effectively code an XML document. In addition, these methods make it possible to effectively code only the identical repetitions of the same data sequence.
To mitigate these drawbacks, other compression methods have been adapted to the XML syntax. Thus one solution is to code the structural information in a binary format instead of using a text format. Several methods exist for this, one example being the ISO standard FastInfoset defined by the specification ITU-T Rec. X.891.
In addition, the redundancy of the structural information in the XML format is eliminated or at least decreased, for example, by omitting the name of the element in the opening tag and closing tag.
According to another method, the XML schema associated with an XML document is used to code the document. This is because the XML schema describes the structure of the data stored in the XML document, the use of the schema makes it possible not to code some of the structural information of the data of the XML document, the latter being able to be reconstructed by the decoder by means of the same XML schema.
This is because it is known, in particular in the FastInfoset standard, how to use an XML schema in order to generate an index table for the names of the elements and attributes. In addition, the schema can make it possible to generate an index table for the predefined contents or those whose options are specified in the XML schema. These tables being constructed from the XML schema, they are not inserted when the XML document is coded.
In addition it is known how to use an XML schema in order to take account of the type of a value in order to code it. This is because XML syntax does not directly support typed data and codes all the data in text form. The Xebu format, described in the article entitled ‘Xebu: A Binary Format with Schema-Based Optimizations for XML Data’ by Jaakko Kangasharju, Sasu Tarkoma and Tancred Lindholm published at the time of the WISE 2005 conference, makes it possible for example to take account of the type of a value in order to code it.
However, an XML schema makes it possible to specify the types of data. With such a method, an integer or a real will therefore no longer be coded in a fairly ineffectual text form but in an optimized form, in terms of both size and coding and decoding time.
These methods make it possible to code more effectively the data contained in an XML document, whilst allowing the reconstruction of the XML document.
Thus it is possible to use a method such as Xebu or FastInfoset in order to generate a compact representation of an XML document, using certain properties of XML syntax, and then to use a generic compression method such as ‘zip’ or ‘gzip’ in order to compress this compact representation.
Such a combination makes it possible to reduce the size of the document generated but is performed in two steps, both at coding and at decoding, which requires a large amount of calculating power and makes it necessary to store intermediate data.
In addition, a generic compression method such as ‘zip’ or ‘gzip’ cannot take account of the properties of XML syntax or of the compact representation used to improve the compression ratio and/or the compression or decompression time.