Since structured documents such as XML documents and HTML documents are in text format, processing apparatuses that analyze these structured documents have largely performed reading/writing, saving and the like of the structured documents in text format. However, since structured documents include redundant data, it takes time for a computer to read/write a structured document as text data. Therefore, a technology has been developed in recent years called binary XML that reduces data size by representing/processing structured documents in binary data format. Note that XML stands for eXtensible Markup Language, while HTML stands for HyperText Markup Language.
With Fast Infoset developed by Sun Microsystems, for example, vocabularies such as element names and attribute names included in the XML data are encoded by being allocated numbers in the order in which they appear in XML data. This enables the size of XML data to be reduced. A table showing the correspondence between the codes and vocabularies is called an encoding table. Note that Fast Infoset is discussed in detailed at the page reached by the following link:
http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=41327&scopelist=PROGRAMME
Where there is an array of numeric values partitioned by separators such as commas and spaces in the attribute values and element content, XEUS, developed by KDDI partitions the XML data with the separators and encodes the numeric values, rather than encoding the entire data as a character string. This enables XML data to be efficiently compressed. Note that XEUS stands for XML document Encoding with Uniformed Sheet.
With the configuration disclosed in Japanese Patent Laid-Open No. 2005-215951 and in BiM developed by MPEG, type information of data included in XHTML and SVG schemas which define the grammar (document structure) of structured documents is analyzed, and optimal encodings for the data type of the attribute values and element content are performed. This enables XML data to be efficiently compressed. Note that MPEG stands for Moving Picture Experts Group. BiM stands for Binary MPEG format for XML. Technical information on BiM can be acquired from the following link:
http://www.iso.ch/iso/en/prods-services/popstds/mpeg.html
SVG stands for Scalable Vector Graphics. XHTML stands for Extensible HyperText Markup Language.
However, the schemas of XHTML, SVG and the like used with conventional technology define the generic grammar (document structure) of a structured document. Therefore, because conventional technology uses schema information defining the generic grammar of a structured document, application-specific document structure is not encoded, even when XML data of the same document structure appears repeatedly.
For example, assume there is a structured document written in SVG such as that in FIG. 1A. FIG. 1A illustrates a structured document in which the same document structure appears repeatedly. In FIG. 1A, reference numerals 9101 to 9103 have the same document structure, only the variables such as the attribute values and character strings are different. In this structured document, an empty element called “circle” has a plurality of attribute values cx, cy, r, fill, stroke, and stroke-width. A “text” element appears after this “circle” element. The “text” element has a plurality of attribute values x, y, and font-size, and includes a character string as element content. The “circle” elements and “text” elements included in this structured document are assumed to represent buttons, as shown by reference numeral 9104 in FIG. 1B.
With conventional binary XML technology, an encoding table such as in FIG. 2 is generated and a structured document such as shown in FIGS. 3 and 4 is encoded by analyzing the datatypes of the attribute values in the “circle” elements using SVG schema, and performing encoding for those datatypes. However, as with the button objects in FIG. 1A, there is a limit to the reduction in data size, since codes are not allocated to application-specific document structure that appears repeatedly.
FIG. 2 illustrates an encoding table generated using conventional binary XML technology. FIGS. 3 and 4 illustrate an encoded document encoded using conventional binary XML technology. With the conventional configuration, a code is allocated for every element name and attribute name, as in FIGS. 2 to 4, despite the same document structure being repeatedly used in the structured document for encoding, as in FIG. 1A. Therefore, there is still room for further reductions in the data size of an encoded document generated using conventional encoding methods.