The invention relates in general to the field of computer systems, and more particularly to a method and system for the compression of structured documents using document descriptions that conforms to a generalized markup language, such as SGML (Standard Generalized Markup Language) and XML (Extensible Markup Language). The invention applies more particularly to metadata describing digital video programs and to mobile services.
In a few years, computer networks became the main media for communications. Now, computers can be plugged to a shared network, operating systems allow applications to easily exchange messages, Internet infrastructure allows computers to find their interlocutor, applications use complex algorithms to synchronize themselves.
In such a context of interoperability, generalized markup languages provides solutions to deal with document processing. Indeed, the structure of a document plays a main role in the document usage. Formatting, printing or indexing a document is essentially made in accordance with its structure. SGML was initially made to easily dissociate document presentation and document structure and content. Because of its ability to encode structures, XML attracted attention from different communities interested in non-document applications. XML audience widened to include (among others) electronic commerce, databases and knowledge representation communities.
XML and more generally markup languages are now widely used to describe and structure documents (metadata). A structured document comprises several information elements which may be nested in each other in a tree-like structure. The information elements are identified and separated from each other by tags, which identify the element types of the information elements. A structured document generally comprises a first information element or base element which represents the entire document and which is identified by tags marking the start and end of the document. This first element comprises information sub-elements, for instance paragraphs of text, each information sub-element being identified by tags marking the start and end of the element. Tags may be associated with tag attributes that specifies one or more characteristics of the information element.
Tag content represents information that is generally intended to be displayed or manipulated by a user. Tag content may be optional or required according to the type of tag, and may contain other nested information sub-elements which in turn are delimited by tags and have content and attributes.
A structured document may be associated with a schema which reflects the rules that the structured document should verify in order to be considered as “valid”. It also contains information about default values, and defining element and attributes types and type hierarchies. Validity ensures that a received document is conformant to the schema and thus has the intended meaning. Moreover it determines what is the nature, i.e., the type of each description item (information element or attributes). XML standard includes an XML Schema Language which is designed to specify a grammar for a class of XML documents having similar structures. Each element type and attribute has a respective name which belongs to an XML namespace.
However XML is a verbose language and thus it is inefficient to be processed and costly to be transmitted. For this reason, ISO/IEC 15938-1 and more particularly MPEG-7 (Moving Picture Expert Group) proposes a method and a binary format for encoding (compressing) the description of a structured document and decoding such a binary format. This standard is more particularly designed to deal with highly structured data, such as multimedia metadata.
As disclosed in U.S. Patent Application Nos. 2004/0013307 and 2004/0054692 filed by the Applicant, the contents of which are incorporated by reference herein, this method relies upon a schema analysis phase so that compression efficiency can be obtained. During this phase, internal tables are computed to associate a binary code to each XML elements, types and attributes. This method mandates the full knowledge of the same schema by an encoder and a corresponding decoder.
In some applications such as digital video broadcast, metadata are transmitted in the video stream in the form of container grouping together data fragments which are likely to have a rather small size. This implies a limited redundancy notably over the string data, and therefore the string compression algorithms exploiting string redundancy, such as ZLIB, are not as efficient as expected in some cases.