1. Field of the Invention
The present invention relates to an information processing apparatus, a control method thereof, a computer program, and a storage medium, and in particular to a technique for processing structured documents.
2. Description of the Related Art
Heretofore, with the XML language specifications set forth by W3C, encoding is typically performed with a character encoding scheme such as UTF-8 or UTF-16 when writing data in the XML language. Increased data size is a problem in this case, since data structures, integers, decimal values and the like are all written as characters.
In contrast, binary XML techniques such as the Fast Infoset (ISO/IEC 24824-1) specification set forth by ISO are known. Binary XML techniques involve encoding integers and decimal values using the original data type, and replacing data structures and values that are described repeatedly with tokens of short data length. Data size can thus be reduced.
However, the following problems arise with the foregoing method when combining a plurality of structured documents such as word processing documents. That is, with binary XML, description that appears repeatedly in a single structured document is standardized, but with word processing documents, description that is repeated over a plurality of structured documents cannot be standardized because the description is repeatedly used within each of the structured documents. Thus, even if binary XML is applied to word processing documents, redundancy occurs in repeated description, and the data size cannot be adequately reduced despite there being a considerable amount of repeated description as a whole.
One conceivable method of combating this involves providing common tokens in advance based on schema information, and using these common tokens throughout the plurality of structured documents. However, this method cannot be applied when schema cannot be defined in advance, or when integrating a plurality of structured documents. With the method that involves providing common tokens in advance based on schema information, tokens even have to be provided for actual structured document data that is little used. Further, providing common tokens for element values and attribute values based on schema information is not easy.