1. Field of the Invention
The present invention concerns a method and a device for processing structured documents. It applies in particular to the XML language (the acronym for “Extensible Markup Language”, meaning extensible markup language). This language is a syntax for defining computer languages. XML thus makes it possible to create languages adapted to different uses but able to be processed by the same tools.
2. Description of the Related Art Including Information Disclosed Under 37 CFR 1.97 and 37 CFR 1.98
An XML document is composed of elements, each element commencing with an opening tag comprising the element name (for example: <tag>) and terminating in a closing tag comprising, also, the name of the element (for example: </tag>). Each element can contain other elements in a hierarchical fashion or text data.
In addition, an element can be specified by attributes, each attribute being defined by a name and having a value. The attributes are placed in the opening tag of the element that they specify (for example: <tag attribute=“value”>).
XML syntax also makes it possible to define comments (for example: ←Comment→) and processing instructions, which may specify to a computer application the processing operations to be applied to the XML document (for example: “<?myprocessing?>”), as well as escape sections that prevent a section of text being interpreted as a tag when it has the form (for example:
“<![CDATA[<text>Escape</text>]]>” where <text> is recognized as a character string rather than a tag).
In XML terminology, all the terms “element”, “attribute”, “text data”, “comment”, “processing instruction” and “escape section” are grouped together under the generic name “item”. In a more general context, all these terms (forming for example the element defined between an opening tag and a closing tag) can be grouped together under the generic name “node”.
This XML data can be described in terms of events. Thus an event would correspond to each part of a document. For an element <tag></tag>, a “Start of element” event will be distinguished first of all, this event being characterized by the name “tag”, then an “End of element” event which, according to the topology of markup languages, contains a reminder of the corresponding element name, here “/tag”, but nevertheless appears independent since an “End of element” element conventionally terminates the last-opened element. The other most frequent events are the “Character” events for text data, “Comment” for comments, or “Attribute” for attributes.
Several different languages based on XML may contain elements with the same name. In order to be able to mix several different languages, an addition has been made to XML syntax for defining namespaces (“Namespace” according to English terminology). Two elements are identical only if they have the same name and are situated in the same namespace. A namespace is defined by a URI (the acronym for “Uniform Resource Identifier”), for example “http://canon.crf.fr/xml/mylanguage”. The use of a namespace in an XML document requires the definition of a prefix that is a shortcut to the URI of this namespace. This prefix is defined by means of a specific attribute (for example: “xmlns:ml=http://canon.crf.rf/xml/mylanguage” associates the prefix “ml” with the URI “http://canon.crf.rf/xml/mylanguage”). Next, the namespace of an element or attribute is specified by causing its name to be preceded by the prefix associated with the namespace followed by “:” (for example: “<ml:tag ml:attribute=”value“>” indicates that the element tag stems from the namespace ml and that the same applies for the attribute attribute).
In certain cases, the above XML elements or data are processed by computer by means of an appropriate language description, called grammar.
This description is based on various rules, also referred to as productions, and by combining these productions it is possible, as desired, to check that data is in conformity with this grammar, or to generate data in conformity with this grammar, for example in order to encode this data.
Solely for purposes of illustration, the following example is taken composed of an alphabet of two symbols, A and B, a symbol X as a starting symbol and two productions:
P0: X→AX
P1: X→B
Starting from X, AX is obtained by using the production P0. If the production P0 is applied again, AAX is obtained. Then the production P1 is applied, and AAB is obtained. The language generated by this grammar is all the strings consisting of a certain number of As and then a B.
Thus productions and grammars constitute a set of rules for constructing a document of hierarchical data, here an XML file.
The use of grammars is advantageous for certain operations performed on XML data, two examples of which will be provided below.
The first example relates to the validation of XML data. According to this, it is checked that the content of an XML document corresponds to a model. This model is described in what is termed a schema. Thus, according to one example, a schema may define an element book by specifying that a book contains a reference, a title, an author and a publication date: in addition, the element author can be defined as containing lastname, a first name and a date of birth.
A schema being known, it is therefore possible to evaluate the conformity of an XML document with respect to this schema. A conventional method for proceeding with this evaluation is based on the use of grammars: from the schema, grammars are constructed, each grammar containing the productions corresponding to the possible events according to the schema. If no order is specified between the reference, the title, the author and the publication date, the content of an element book is described by a grammar containing a production for each of these elements (in the case of any constraint, in particular a constraint of order, it is necessary to provide a more complex grammar that allows only the sequences corresponding to these constraints). If data is encountered that correspond to none of these productions, then the data is not in conformity and the validation fails.
The second example is that of the compression of XML data. XML data is text data that is generally redundant, and thus it is useful to compress it. The most conventional method for compressing XML data consists of replacing the XML tags with codes. Thus it can be decided that, for a start of element book, the hexadecimal code 0x01 is used, while for a start of element author, 0x02 is used, and the code 0xFF for its part designates the end of an element. However, with this type of encoding, compression is not optimal since, whatever the current context, a code always has the same meaning. However, according to the context, only certain events can occur. Let us take the example of the element book and the element author: if the XML document presents a list of books, then it is logical for the element author to appear only within an element book, the author being seen in this case as a characteristic of the book. Consequently it can be considered that it is not useful to reserve the code 0x02 for author when not in a book, since it is assumed that in this context the element author would not occur.
One solution for improving this technique is based on the use of the grammars presented above. This is particularly the case with “efficient XML”, the format used as a basis for the standardization of a binary XML format by the EXI (the acronym for “Efficient XML Interchange”) working group of the W3C (the acronym for “World Wide Web Consortium”, the organization producing standards for the Web) disclosed in the publication “EXI Format 1.0—W3C working draft of Dec. 19, 2007” by J. Schneider. The EXI format takes account of the order of appearance of the various items within a document in order to construct a grammar that makes it possible to encode the most frequent items in a small number of bits. With each element, a grammar is associated that describes the data liable to be encountered within this element. For each type of data, that is to say for each type of event within this context element, a production is inserted in the grammar and a code is associated with this production. In this way, the codes used for coding the data depend on the context, which makes it possible to use codes of smaller size. In the case of book and author the grammar associated with the element book comprises a production describing the start of the element author, and the other grammars do not contain such a production.
As specified in the publication, the events or items dealt by a production are defined by an event type (for example a start of an element, an attribute, a character, etc.) and depending on the event type, an item of content (value, text, etc.) or a qualified name.
When data thus structured is processed using grammars, it is known conventionally how to seek the production describing a data item, corresponding to an event in the context element, using a hash table. This table associates keys, corresponding to the data, with objects, corresponding to the productions. Thus, from a data item in the element, a number is calculated using a hash function. The production describing the data item is then found in the table at the index corresponding to the number obtained. The access times to a hash table are on average satisfactory.
Nevertheless, the defect in solution lies in the expensive calculation, in terms of time and resources, of the hash value using a hash function.
Alternative solutions have been proposed for accessing the productions describing the data in an element. In particular probabilistic grammars are known in which a probability is associated with each production and from which the production that has the most chance of being used is determined.
These probabilistic grammars do however have the drawback of requiring learning by means of a certain number of prior structured documents, as illustrated by the patent application US 2007/0 022 373. This learning also gives rise to significant costs prior to the processing of a structure document and does not allow the effective processing of documents with different origins, in particular if the document to be processed does not have a structure close to those of the learning documents.
Thus the problem is posed of knowing how to more effectively obtain a production for an event to be processed.