1. Field of the Invention
The present invention relates to a method for decoding encoded structured data from a bit-stream and a corresponding decoder, using in particular a multi-core system.
A particular but non-exclusive application of the present invention is the decoding of an EXI (standing for “Efficient XML Interchange”) bit-stream obtained from the encoding of a structured document such as an XML file (standing for “extensible Markup Language.
2. Description of the Related Art
The XML (standing for “Extensible Markup Language”) format is a syntax for defining computer languages, which makes it possible to create languages adapted to different uses which may however be processed by the same tools.
An XML document is composed of elements, each element being delimited by an opening tag (or start tag) comprising the name of the element (for example: <tag>) and a closing tag (or end tag) which also comprises the name of the element (for example </tag>). Each element may contain other elements in a hierarchical child-parent relationship and/or contain text data defining content. Given these structural tags, data in such a document is referred to as “hierarchized structured data”.
The definition of an element may also be refined by a set of attributes, each attribute being defined by a name and having a value. The attributes are then placed in the start tag of the element they are refining (for example <tag attribute=“value”>).
XML syntax also makes it possible to define comments (for example: “<!--Comment-->”) and processing instructions, which may specify to a computer application what processing operations to apply to the XML document (for example: “<?myprocessing?>”).
Several markup languages based on the XML language may contain elements with the same name. To allow several different languages to be mixed in the same document, an extension of the XML language has been specified to define “Namespaces” for XML. Using this extension, two elements are identical only if they have the same name (also named “local-name”) and are located in the same namespace.
A namespace is defined by a URI (standing for “Uniform Resource Identifier”), for example “http://canon.crf.fr/xml/mylanguage”. The use of a namespace in an XML document is via the definition of a prefix which is a compact representation of the URI of the namespace. The prefix is defined using a specific attribute (for example “xmlns:ml=“http://canon.crf.fr/xml/mylanguage” binds the prefix “ml” to the URI “http://canon.crf.fr/xml/mylanguage”). Next, the namespace of an element or of an attribute is specified by preceding its name with the prefix associated with the namespace followed by “:” (for example “<ml:tag ml:attribute=“value”>” indicates that the element “tag” arises from the namespace ml and that the same applies for the attribute “attribute”).
These above features identifying a specific element or an attribute are called “qualified name” (or “qname” according to the EXI recommendation). Thus, a qualified name groups together the local name of the element or attribute (for example “tag”), possibly a URI (for example “xmlns:ml=“http://canon.crf.fr/xml/mylanguage”) and possibly a prefix (for example “ml”). A qname uniquely identifies a type of structural XML data. Several occurrences of the same qname element may then occur in the XML data, usually with different values.
In XML terminology, the set of the terms “element”, “attribute”, “text data”, “comment”, “processing instruction” and “escape section” are grouped together under the generic name of “item”.
Hierarchized structured data incorporates two types of information: a first type of information defining the structure of the data and a second type of information defining the actual content of the data.
The pieces of information of the first type, termed “structural information”, are all pieces of information which serve for hierarchizing the data, for example defined by an item type (through a qname). The pieces of information of the second type, termed “content information”, represent the values or the instances taken by the data.
By way of illustration, for an element containing text data such as <element>example</element>, the start tag <element> of the element and the end tag </element> are structural information, whereas the string “example” is content information.
The link between the pieces of structural information and the pieces of content information depends on the language used for hierarchizing the data. However, generally, a document containing hierarchized structured data may be seen as a set of “items” organized as a “tree”. The content information corresponds to the leaves of the tree (element content and attribute values) while the structure information corresponds to the nodes and links between nodes in the tree (elements and attributes names, type of the nodes and child/parent relationships of the nodes).
As shown above, the XML items may be described in terms of events, also called XML events. Thus, there are XML events relating uniquely to a piece of structural information such as the start tag and the end tag, and there are XML events comprising content, for example the attributes or the text data of the elements or comments.
For the needs of illustration below, the following notations will be used for the XML event structural information:                SE: start tag event;        EE: end tag event;        ATTR: attribute event;        CH: character or text event.        
The content information consists for example of the ATTR and CH event values. To process an XML document, it must be read from memory.
A first family of reading approaches consists in reading the XML data as a sequence of events, processing one event at a time. The methods of this family, such as for example the SAX API (standing for “Simple API for XML”), allow reading of the XML document by streaming, enabling use of little memory space. However, these approaches do not provide easy access to any desired part of the data.
A second family consists in representing the whole XML data in memory as a tree, such as the DOM API (standing for “Document Object Model”). The methods of this family generally enable easy and fast access to each part of the XML document. However, it requires a large amount of memory to store all the data simultaneously.
Access to an XML document generally requires parsing the XML events in order to identify structural information and content information.
Although the present invention as set out below applies to any type of document in XML format, it is particularly advantageous with XML formats such as SVG, Open Office, Open Document Format, PDFXML or XPS which generally provide an organization of the documents by section or page and are generally very voluminous.
The SVG format is an XML-based format for data designed to describe sets of vector graphics. The Open Office format and the Open Document Format are XML-based formats designed for storing electronic documents such as word processor, presentation or spreadsheet documents. The PDFXML format is an XML-based format for describing pages which enables preservation of the formatting of the original document. The XPS format is an XML-based format used to describe documents to be printed.
XML documents in such formats tend to be large. The present invention is particularly directed to decoding such large documents that have been previously encoded.
XML has many advantages and has become a reference language for storing data in a file or for exchanging data. XML in particular provides access to numerous tools for processing the files generated. Moreover, an XML document can be edited manually with a simple text editor. In addition, as an XML document contains its structure integrated in the data, such a document is highly legible even without knowing its specification.
However, the main drawback of XML syntax is to be verbose. This means that the size of an XML document may be several times greater than the intrinsic size of the raw data. This large size of XML documents also gives rise to a long processing time when XML documents are generated and read. It also leads to a long transmission time.
To remedy these drawbacks, methods of compressing and encoding an XML document have been sought. The aim of these methods is to encode the content of an XML document in a more efficient form, while enabling the XML document to be reconstructed easily. This is in particular the case for the Binary XML formats that produce binary bit-stream.
The simplest binary way to encode XML data is to encode markup information using a binary format instead of a text format. This has been improved by eliminating or decreasing redundancy in markup information, for example by avoiding specifying the name of the element in both the start tag and the end tag. Such mechanisms are used in all Binary XML formats.
More advanced mechanisms, such as some involved in EXI and Fast Infoset formats, use one or more indexing tables for encoding and decoding XML data, for example string tables or “partitions” as defined in the EXI recommendation. These indexing tables are very efficient when encoding repeated strings such as item names.
In practice, on the first occurrence of a repeated name, it is encoded literally as a string and an index is associated with it in a corresponding entry of the table. The indexes are incremented each time, and later coded over the fewest number of bits adapted to represent all the N indexes of the table (i.e. ? log 2(N)? bits).
Next, on each further occurrence of that name, it is encoded using the index of the associated entry, instead of the full string. This allows the size of the encoded document to be reduced, but also allows the parsing speed of the document to be increased.
In the EXI format, there is a global indexing table that comprises all the string values indexed and that is shared by the entirety of the XML data, and there is a plurality of local indexing tables. Every new index created in a local indexing table is simultaneously created in the global indexing table.
Each local indexing table comprises the string values of items having the same qname type, i.e. of items having the same specific name. A local indexing table associated with a qname is thus used for indexing the content of the qname item (for example an attribute value, an element content, etc.).
For the purpose of illustration, on the first occurrence of a child element in the content of a given element, a new entry is added to the local indexing table (associated with the given element) to describe that child element. Further occurrences of a similar child element are described using the associated index. Local indexing tables generally produce shorter codes than the global value tables. A string value can only be assigned to one local indexing table.
In some embodiments aiming at limiting the memory used by indexing tables, a bounded table feature is specified that limits the number of indexed values. In this case, once the limit is reached, the oldest entry is removed (from the global and local tables) and the index of the oldest entry is assigned to the new string value.
EXI format also uses grammars (tables) and productions to provide priority codes for encoding the structural information of XML data, the shorter codes being assigned to the more probable structure information. A detailed overview of these grammars and productions may be obtained from the EXI recommendation.
Grammars, local and global indexing tables generally evolve progressively with the encoding or the decoding of the XML data. The evolution of processing of these tables is similar during encoding and decoding. This means that, at the same point in the XML data, the tables must be the same whatever the encoding or the decoding in course. It may be noted that the grammars may also be built from the knowledge of an XML Schema.
Content information encoding parameters may also be selected according to the knowledge of the XML Schema that allows a type to be assigned to each XML value. For example, encoding types have been designed to represent integers, floats, strings, etc. On the other hand, without an XML Schema, the values are all given a string type.
XML values of string type can be represented, in a coded form, using indexes of indexing tables, such as EXI global or local indexing tables as introduced above, that are progressively built during the encoding and decoding processing.
XML schema (of which a description may be found at the addresses: http://www.w3.org/TR/xmlschema-1/ and http://www.w3.org/TR/xmlschema-2/) is a language which defines the types of data present in an XML document. A document written in XML schema constitutes like a “directory” of the types of data authorized and a structural model for all the XML documents conforming to that schema. This may concern the integer, float, or string type of a value, but also a qname type of structural information. An XML Schema may be used by both the encoder and the decoder to create the tables before processing the data.
Generally, using an XML Schema leads to improved compression of the structure since no learning is actually needed and codes obtained from schema-informed grammars are generally shorter than codes from learned grammars. For instance, schema-informed grammars produce shorter codes for a sequence of four mandatory elements (0 bit for each element ? all predictable) than the learned grammar counterpart (2 bits for each element as learned grammars are modelled as a choice).
In addition, the schema knowledge makes it possible to give a type to the XML values and to use specific encoding types that are generally more compactly encoded than the default string encoding. For instance, Boolean values can be represented as 1 bit while their default string representation will be at least on 16 bits.
In order to further improve the compression, two modes have been set up for EXI coding that provide another organization for the coded information within the EXI bit-stream generated: a mode referred to as “pre-compression” and a mode referred to as “compression”, which are distinguished from each other by the sole additional implementation for the “compression” mode of a final lossless compression algorithm, for example of DEFLATE type.
Both modes use rearrangement of the XML data in the EXI bit-stream. Rearrangement reduces entropy of the data by grouping together similar information, and thus better compression of the XML data is obtained.
This involves:                grouping the XML event structure information, while keeping the original order of the original XML data, to obtain a structure channel,        grouping the XML event content information (values, names, etc.) having the same qname to obtain a plurality of value channels corresponding to a plurality of qname items. In each value channel, the values are kept in the original order of the values in the original XML data,        putting the value channels in order according to the structural information (i.e. the structure channel), that is to say in the order of the first occurrences in the original XML data of the associated qnames,        encoding the structure channel (using the priority codes for the structure channel) and then the value channels in the resulting order. The values are encoded one after the other in each value channel, and the value channels are encoded one after the other. For each value channel, a local indexing table is used to index string values.        
In the “compression” mode, a compression of DEFLATE type is then carried out on each of the structure and value channels so obtained, with the optional setting up of strategies for grouping channels together (essentially for values) depending on their size (in number of values).
As for the “pre-compression” mode, this only re-organizes the data by channels as explained above, resulting in interleaved data. This “pre-compression” mechanism is illustrated using the SVG document 10 represented in FIG. 1A.
FIG. 1B represents the structure channel 12 thus obtained, using the notation introduced earlier.
FIG. 1C presents the order of the different value channels 14 so obtained, i.e. successively the channels having to the qualified names width, height, d, style, x, y, xlink:href and rx.
The order of the values in these channels is that shown in FIG. 1D, and corresponds to the order of the values in the XML document for each channel, these channels themselves being put in order as in FIG. 1C.
As represented in FIG. 1E, the EXI coding of the document in FIG. 1A generates an EXI bit-stream 20 comprising, first of all, the structure channel 12 including the priority codes defined by the grammars, then the value channels 14 in the order defined in FIG. 1D. The exact encoding of each value is defined by the EXI recommendation: by default, the values are encoded as strings of characters using global and local indexing tables.
It is to be noted that the global indexing table evolves during the encoding of each successive value channel, to receive at least each new string value that is added to a local indexing table.
Decoding an EXI bit-stream 20 uses the same mechanisms, i.e. first decoding the structure channel and then decoding each of the value channels in the same order. The global and local indexing tables are created in a similar way to that during the encoding.
With the recent advent of multi-core CPUs, parallel approaches to processing XML data, whatever their form, for example an EXI bit-stream or an XML file, are becoming attractive.
A parallel approach appears to be very efficient for processing large documents, in particular EXI bit-streams for which the decoding time is long. The approach is even more attractive for highly compressed EXI bit-streams resulting from the EXI (pre) compression mode since their decoding is even more demanding in terms of resources and decoding time.
Sharing the decoding of a highly compressed EXI bit-stream amongst several cores of a multi-core decoder is not an easy matter.
One may note that it is all the more difficult since, given the above construction of highly compressed EXI bit-streams, the encoded channels comprise interleaved data.
In this respect, it is not easy to obtain units in the bit-stream that may be considered as independent enough for parallel processing which requires independent decoding tasks.
This raises several difficulties. A first difficulty lies in the identification of the positions of “independent-sufficient” units within the bit-stream.
Another difficulty concerns the relationships between different parts of the EXI bit-stream that result from indexing data of those parts using the same global indexing table. Such relationships appear to be contrary to independency of those parts or units. During mono-task decoding, the indexing tables are progressively created, with all the information required each time for decoding the next piece of encoded data. However, during parallel decoding, this is no longer the case.
A further difficulty is to use each CPU core appropriately, i.e. to produce an even workload or the like amongst the CPU cores.
Therefore it would be desirable to provide efficient parallel decoding of encoded structured data in a multi-core decoder.