The present invention concerns access to documents of the XML (eXtensible Markup Language) type and more particularly methods and devices for optimizing the processing of XML documents.
The XML format is a standard for representing data in text form. These data are organized in a hierarchical manner in the form of trees. The XML processing entities, or parsers, give access to the data of the XML document via this tree structure.
There exist various types of XML parsers. The DOM (Document Object Model) constructs the entire tree in memory and enables a user to navigate in this tree composed of XML nodes. The drawbacks of this model are the amount of memory necessary for its implementation and the need to receive the whole of the document before beginning to process it.
To resolve these problems, other parser models have been developed, such as the SAX (Simple API for XML) and PULL models according to which a tree is not constructed in memory. Such parsers make it possible to navigate in the XML tree by going from XML node to XML node, using an in-depth exploration algorithm first of all. It keeps in memory only the current node of the XML tree. In this context, a XML node can in particular correspond to an opening XML element, a closing XML element or a text element. In the following example, the XML fragment contains three nodes: a opening element, a text node and a closing node.
<ns:example attribute=‘value’>
Textnode
<\ns:example>
The XML parser breaks down each node into a set of items, represented in the form of a character chain, the exact set of items depending on the exact implementation of the parser. Taking the previous example, the first node (the opening element) can be separated into four items: “ns” (or “ns:example” depending on the implementation of the parser), “example”, “attribute” and “value”. The second node is represented as a single item: “TextNode”. The third node is represented by two items: “ns” (or “ns:example” depending on the implementation of the parser) and “example”. Each item has a particular function and is made accessible to the parser via a particular API (Application programming interface).
In the case of SAX for example, the parser calls functions implemented by the application, specialized for each type of node. Taking the previous example again, the SAX parser will call in the following order:
1. a function of the “STARTTAG” type with, as a parameter, the local name of the element (here “example”), its qualified name (here “ns:example”), and a list of attributes (here a single attribute whose name is “attribute” and value is “value”);
2. a function of the “TEXTNODE” type with, as a parameter, the value of the text node (here “textnode”); and,
3. a function of the “ENDTAG” type with, as a parameter, the local name of the element (here “example”) and its qualified name (here “ns:example”).
The application can then use each item passed by the parser as a parameter of the functions for processing the data.
XML language is used as a basis by certain languages such as WSDL, XML Schema or Relax NG, which describe components. These languages define various types of component. These components are described as XML elements within an XML document. The identification of a component, called the QName of a component, corresponds to the name of the component associated with an identifier global to all the components of the document. A component is identified uniquely by its QName name and its type. These identifiers are particularly used to connect two components. During the XML processing of the components, it is necessary to connect the components together, following the links expressed in the form of references by QName. These links can also point to components already defined (“backward” reference) or not yet defined (“forward” reference), as shown on FIG. 7.
According to FIG. 7, the identifier “msg” of the line “input message=“msg”” refers to the line “message name=“msg”” and the identifier “pt” of the line “binding name=“bd” type=“pt” refers to the line “portType name=“pt”””. Thus the identifier “msg” points to a component already defined whilst the identifier “pt” points to a component that is defined only subsequently in the document.
Circular references can also be used. For example, a first component references a second component that itself references a third component that references the first component. If circular references exist, at least one component is of the mixed type, that is to say at least one component has “backward” and “forward” references. It should however be noted that an unsequenced initial document may have components of the mixed type, that, after the resequencing of the document, are of the “forward” or “backward” type and therefore do not correspond to circular references.
During a progressive processing of an XML document, it is necessary to manage these references, whether or not the components are defined. This management requires significant memory and calculation resources.
Many documents are in “backward” mode. This is in particular the standard writing mode for WSDL documents. This mode makes it possible to resolve a reference at the time that this reference is detected.
This principle of passing from a document without order to a document in “backward” mode is described in the American patent application US 20050193135. According to the description of this application, a server receiving a request for a document modifies the structure of the document in order to put it in “backward” mode and transmits it to the client. In this context, the client can only process documents in “backward” mode. This solution does not take into account circular references. Nor does it take into account the situations where the document processing unit would have had an advantage in receiving the document comprising the references sequenced according to the “forward” mode.