Presently, data is often stored and transmitted in a structured document that contains a plurality of different types of data. A structured document is a set of elements each associated with a type and at least one attribute, and interconnected by relationships that are mainly hierarchical. A typical example of the structured document is the extensible markup language (XML) document.
The structured document includes markers (also called “tags”) for separating different elements. An element may itself comprise a plurality of attributes and lower-level elements, which are also called sub-elements. Thus, the structured document presents a tree or hierarchical structure, each node represents an element and is connected to a node at a higher hierarchical level representing an element that contains the elements at lower level. The nodes located at the ends of branches in such a tree structure represent elements containing data that can not be divided into information sub-elements. Herein, the data of the node located at the ends of branches is considered as the attribute value of a certain type.
There are several compression methods for encoding structured documents, of which one is the schema-based compression method. The schema for defining a structured document itself is also a structured document. A typical example of the schema is the XML schema. Generally, an XML schema is a set of schema components that define the structure of an XML instance. The schema component, which itself is also an element, is a generic term for the building blocks that comprise the data model template of the schema. In the process of compressing an instance of a structured document using a schema-based compression method, a Finite State Automaton (FSA) is derived from the definition of a schema, and then an instance of the schema or portion of such instance can be converted to a bit stream with the aid of the corresponding FSA. Some schema components may have an occurrence constraint, which is defined by the attributes of minOccurs and maxOccurs. This kind of schema components is usually called occurrence node.
Below is an example of an XML schema containing an occurrence node with maxOccurs attribute set to 100.
<?xml version=“1.0” encoding=“ISO-8859-1”?><schema targetNamespace=“urn:thomson:SchemaExample”xmlns=“http://www.w3.org/2001/XMLSchema”xmlns:s=“urn:thomson:SchemaExample”xmlns:xs=“http://www.w3.org/2001/XMLSchema”xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” ><element name=“testSchema”><complexType><choice maxOccurs=“100”><element name=“e1” type=“xs:string”/><element name=“e2” type=“xs:string”/><element name=“e3” type=“xs:string”/><element name=“e4” type=“xs:string”/><element name=“e5” type=“xs:string”/></choice></complexType></element></schema>
Below is an example of an instance according to the above XML schema.
<?xml version=“1.0” encoding=“ISO-8859-1” ?><s:testSchema xmlns:s=“urn:thomson:SchemaExample”xmlns:b=“urn:thomson:SchemaB”xmlns:a=“urn:thomson:SchemaA”xmlns:c=“urn:thomson:SchemaC”xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”xsi:schemaLocation=“urn:thomson:SchemaExample ./SchemaEx-ample.xsd”><e1>AAAA</e1><e1>BBBB</e1><e1>CCCC</e1><e1>DDDD</e1><e1>EEEE</e1></s:testSchema>
It can be seen that element e1 repeats 5 times with different data values in this XML instance. The conventional schema-based compression method generates 5 times the same structure information of element e1 in the resulting encoded bit stream, which is deemed redundant.