The present invention relates to a method for compressing/decompressing structured documents.
It applies particularly, but not exclusively, to the transmission of documents such as images or image sequences, video or sound data, via digital data transmission networks, and to the storage of such documents.
There are currently in existence a number of digital document compression algorithms. Some compression algorithms are designed to process the document's binary data directly, without taking account of the data type. These algorithms have the advantage of being able to process any document, but are ineffective (low compression rate) in processing bulky documents, which are generally of the sound or image type.
Furthermore other compression algorithms are known which are more efficient, but specially adapted to one data type, for example image or sound, with the result that they cannot be used or are ineffective if they are applied to documents which do not exclusively contain data for which they are designed.
Increasingly however, the documents being used and circulating on data transmission networks contain several information types integrated in one structure.
A structured document is a collection of data sets each associated with a type, and arranged together according to mainly hierarchical relationships. These documents employ a structuring language such as SGML, HTML, XML, making it possible particularly to distinguish the different data sets composing the document. In contrast, in a so-called linear document, the document's content information is mixed with the presentation and typing information.
A structured document thus includes locators or markers separating the different data sets of the document. In the case of SGML, XML or HTML formats, these locators known as “tags” are of the form “<XXXX> and “</XXXX>”, the first marker indicating the start of the data set “XXXX” and the second the end of this set. A data set may be composed of several lower level data sets. In this way, a structured document has a hierarchical or tree structure schema, each node representing a data set and being connected to a higher hierarchical level node representing a data set which contains the lower level data sets. The nodes located at the end of a branch of this tree structure represent data sets containing data of a predefined type, which cannot be broken down into data subsets.
A structured document is generally associated with what is called a structure schema setting out in rule form the structure and type of information of each data set of the document. A schema is constituted by nested groups of data set structures, these groups being for example ordered sequences, alternative element groups or necessary element groups, sequenced or non-sequenced.
A structured document is thus associated with a structure schema and contains separation markers represented in the form of textual or binary data, these markers delimiting data sets which are themselves able to contain other data sets delimited by markers. The result is that a document structured in this way is able to include not only textual data, but also any other type of information (for example sound data, images, etc.). Consequently the specific compression algorithms of one particular type of data are ineffective and ill adapted in respect of processing this type of document.