1. Field of the Invention
The invention relates in general to the field of computer systems, and more particularly to a method and system for the compression of structured documents using document descriptions that conforms to a generalized markup language, such as SGML (Standard Generalized Markup Language) and XML (Extensible Markup Language). Such documents may contain multimedia information.
2. Description of the Related Art
In a few years, computer networks became the main media for communications. Computers can now be plugged to a shared network, operating systems allow applications to easily exchange messages, Internet infrastructure allows computers to find their interlocutor, applications use complex algorithms to synchronize themselves.
In such a context of interoperability, generalized markup languages provides solutions to deal with document processing. Indeed, the structure of a document plays a main role in the document usage. Formatting, printing or indexing a document is essentially made in accordance with its structure. SGML was initially made to easily dissociate document presentation and document structure and content. Because of its ability to encode structures, XML attracted attention from different communities interested in non-document applications. XML audience widened to include (among others) electronic commerce, databases and knowledge representation communities.
XML and more generally markup languages are now widely used to describe and structure documents (metadata). A structured document comprises several information elements which may be nested in each other. The information elements are identified and separated from each other by tags, which identify the element types of the information elements. A structured document generally comprises a first information element or base element which represents the entire document and which is identified by tags marking the start and end of the document. This first element comprises information sub-elements, for instance paragraphs of text, each information sub-element being identified by tags marking the start and end of the element. Tags may be associated with tag attributes that specifies one or more characteristics of the information element.
Tag content represents information that is generally intended to be displayed or manipulated by a user. Tag content may be optional or required according to the type of tag, and may contain other nested information sub-elements which in turn are delimited by tags and have content and attributes.
A structured document may be associated with a schema which reflects the rules that the structured document should verify in order to be considered as xe2x80x9cvalidxe2x80x9d. It also contains information about default values, element and attributes types and type hierarchies. Validity ensures that a received document is conformant to the schema and thus has the intended meaning. Moreover it determines what is the nature, i.e. the type of each description item (information element or attributes). XML standard includes an XML Schema Language which is designed to specify a grammar for a class of XML documents having similar structures.
However XML is a verbose language and thus it is inefficient to be processed and costly to be transmitted. For this reason, ISO/IEC 15938-1 and more particularly MPEG-7 (Moving Picture Expert Group) proposes a method and a binary format for encoding (compressing) the description of a structured document and decoding such a binary format. This standard is more particularly designed to deal with highly structured data, such as multimedia data.
In order to gain compression efficiency, this method relies upon a schema analysis phase. During this phase, internal tables are computed to associate a binary code to each XML elements, types and attributes. This method mandates the full knowledge of the same schema by an encoder and a corresponding decoder.
When a schema used to encode structured documents requires to be extended, the best solution is to make the extended schema available to the decoder. However in specific cases, it is not possible to easily update the decoders in order to give them access to the extended schema.
An object of the invention is to provide a method for encoding a structured document in such a manner that the document can be partially decoded even if every needed schema are not known by the decoder.
Another object of the invention is to provide such an encoding method ensuring a backward and forward compatibility, i.e. enabling a decoder to at least partially decode a structured document having a structure defined in at least a first schema not accessible to the decoder and resulting from a change of at least a second schema accessible to the decoder, a structured document comprising information elements nested in each other, the information elements of the document being associated in at least a first and a second schemas with respective element types each defining the respective element structures of the information elements, the first schema being not accessible to a decoder and the second schema being accessible to the decoder, the first schema defining at least one derived information element which is derived from a corresponding element defined in the second schema.
According to the present invention, the encoding method comprises the steps of:
encoding the document using said first and second schemas into a binary stream comprising for each information elements of the document a binary sequence encoding the information element, and
inserting in the binary sequence encoding the derived information element a reference designating the first schema in which the structure of the derived element is defined, said reference designating the first schema being defined in a schema reference list containing references to all schemas used for encoding the document, the schema reference list being made accessible to the decoder.
According to an aspect of the present invention, the binary sequence encoding each element of the document comprises a content field containing an encoded value of the element and a length field placed before the content field and containing an encoded value of a length of the content field.
According to another aspect of the present invention, the derived information element is associated in the first schema to a structure type which is restricted with respect to the structure type of the corresponding information element in the second schema, the binary sequence encoding the derived element comprising a content field and appended to the content field, a reference to the first schema and a reference to the structure type of the derived element, defined in the second schema.
According to another aspect of the present invention, the derived information element is associated in the first schema to a structure type which is extended with respect to the structure type of the corresponding information element in the second schema, the structure type of the derived information element comprising a first part having the structure type of the corresponding information element defined in the second schema and a second part specific to the derived information element and having a structure type defined in the first schema, the binary sequence encoding the derived element comprising a content field comprising:
a field containing the reference to the second schema,
a field containing a structure type reference to the structure type of the corresponding element in the second schema,
a field containing an encoded value of the first part,
a field containing the reference to the first schema,
a field containing a structure type reference to the structure type of the second part, and
a field containing an encoded value of said second part.
According to another aspect of the present invention, the binary sequence encoding an information element comprises a substitution field including a substitution flag indicating whether or not the name of the information element is changed, and if the substitution flag indicates a change, an element name reference field containing a reference designating a new name of the information element, and a schema reference field containing a reference to a schema where the new name reference is defined.
According to another aspect of the present invention, the binary sequence encoding at least one information element in the encoded document comprises a schema status mode field having:
a first state indicating that the information element is not changed in the first schema with respect to a corresponding element in the second schema,
a second state indicating that none of sub-elements of the information element are changed in the first schema with respect to the corresponding element in the second schema, and
a third state indicating that the information element is changed in the first schema with respect to the corresponding element in the second schema,
the encoded information element comprising any schema reference and any other change information when the schema status mode field is in the first state, and none of sub-elements of the information element comprising a schema reference and any other change information when the schema status mode field is in the second state.
According to another aspect of the present invention, the binary sequence encoding at least one information element in the encoded document comprises a schema status mode field having a first state indicating that the information element is not changed in the first schema with respect to a corresponding element in the second schema, a second state indicating that none of sub-elements of the information element are changed in the first schema with respect to the corresponding element in the second schema, a third state indicating that the information element is changed in the first schema with respect to the corresponding element in the second schema, and a fourth state indicating that the information element is changed in the first schema with respect to the corresponding element in the second schema and that none of sub-elements of the information element are changed in the first schema with respect to the corresponding element in the second schema, the encoded information element comprising any schema reference and any other change information when the schema status mode field is in the first state, and none of sub-elements of the information element comprising a schema reference and any other change information when the schema status mode field is in the second state or fourth state.
According to another aspect of the present invention, the schema reference list comprising references to all schemas used for encoding the structured document is inserted in a header associated to the binary stream encoding the structured document.
Another object of the present invention is to provide a method for at least partially decode a binary stream encoding a structured document having a structure defined in at least a first schema not accessible to the decoder and resulting from a change of at least a second schema accessible to the decoder, a structured document comprising information elements nested in each other, the information elements of the document being associated in at least a first and a second schemas with respective element types each defining the respective element structures of the information elements, the first schema being not accessible to a decoder and the second schema being accessible to the decoder, the first schema defining at least one derived information element which is derived from a corresponding element defined in the second schema.
According to the present invention, the decoding method comprises the steps of:
sequentially reading a binary stream encoding the structured document using the second schema so as to detect in the binary stream binary sequences encoding each information element of the document,
detecting in a each detected binary sequence of an encoded information element a reference to the first schema, as defined in a schema reference list known by the decoder,
if such a reference to the first schema is not detected in the detected binary sequence, decoding said detected binary sequence according to the corresponding element in the second schema,
a field containing an encoded value of the first part,
a field containing the reference to the first schema,
a field containing a structure type reference to the structure type of the second part, and
skipping said binary data relative to said first schema during the sequential reading and decoding of said binary stream.
According to another aspect of the present invention, the binary sequence encoding each element of the document comprises a content field containing an encoded value of the element and a length field placed before the content field and containing the length encoded value, the length encoded value being used by the decoder for determining the end of the binary sequence encoding an element.
According to another aspect of the present invention, the decoding method further comprises the steps of:
reading and decoding a length coded value in the binary sequence containing a reference to the first schema, and
determining a length of binary data to skip as a function of the decoded length value and the position in the binary sequence of the reference to the first schema.
According to another aspect of the present invention, the derived information element is associated in the first schema to a structure type which is restricted with respect to the structure type of the corresponding information element in the second schema, the binary sequence encoding the derived element comprising a content field and appended to the content field, a reference to the first schema and a reference to the structure type of the derived element, defined in the second schema
According to another aspect of the present invention, the derived information element is associated in the first schema to a structure type which is extended with respect to the structure type of the corresponding information element in the second schema, the structure type of the derived information element comprising a first part having the structure type of the corresponding information element defined in the second schema and a second part specific to the derived information element and having a structure type defined in the first schema, the binary sequence encoding the derived element comprising a content field comprising:
a field containing the reference to the second schema,
a field containing a structure type reference to the structure type of the corresponding element in the second schema,
a field containing an encoded value of the first part,
a field containing the reference to the first schema,
a field containing a structure type reference to the structure type of the second part, and
a field containing an encoded value of said second part.
According to another aspect of the present invention, the derived information element has in the first schema a name which is changed with respect to the name of the corresponding information element in the second schema, the binary sequence encoding the derived element including a substitution field comprising a substitution flag indicated whether or not the name of the derived information element is changed, and if the substitution flag indicates a change, a schema reference field containing a reference to the first schema and an element name reference designating the name of the derived information element in the first schema.
According to another aspect of the present invention, the binary sequence encoding at least one information element in the encoded document comprises a schema status mode field having:
a first state indicating that the information element is not changed in the first schema with respect to a corresponding element in the second schema,
a second state indicating that none of sub-elements of the information element are changed in the first schema with respect to the corresponding element in the second schema, and
a third state indicating that the information element is changed in the first schema with respect to the corresponding element in the second schema,
the encoded information element comprising no schema reference and no other change information when the schema status mode field is in the first state, and none of sub-elements of the information element comprising a schema reference and any other change information when the schema status mode field is in the second state.
According to another aspect of the present invention, the binary sequence encoding at least one information element in the encoded document comprises a schema status mode field having a first state indicating that the information element is not changed in the first schema with respect to a corresponding element in the second schema, a second state indicating that none of sub-elements of the information element are changed in the first schema with respect to the corresponding element in the second schema, a third state indicating that the information element is changed in the first schema with respect to the corresponding element in the second schema, and a fourth state indicating that the information element is changed in the first schema with respect to the corresponding element in the second schema and that none of sub-elements of the information element are changed in the first schema with respect to the corresponding element in the second schema, the encoded information element comprising any schema reference and any other change information when the schema status mode field is in the first state, and none of sub-elements of the information element comprising a schema reference and any other change information when the schema status mode field is in the second state or fourth state.
According to another aspect of the present invention, the schema reference list comprising references to all schemas used for encoding the structured document is read in a header associated to the binary stream encoding the structured document.