(1) Field of the Invention
The present invention relates to a technique of compressing and decompressing data, particularly, to an apparatus, a method and a recording medium suitable for use when a document (a tag document) structured and described according to control characters (strings) called tags defining a document structure is compressed and decompressed.
(2) Description of the Related Art
A recent trend is to unify formats of documents handled by computers, an aim of which is to be able to handle formats of documents, which have been different from computer to computer, or from application to application, in different computer environments.
For example, there is an international standard (IS08879) for a document format called SGML (Standard Generalized Markup Language) established by ISO in 1986. An SGML document consists of, as schematically shown in FIG. 31, three portions, that is, SGML declaration 301, document type definition (DTD: Document Type Definition) 302 and document instance 303.
The SGML declaration 301 is a portion declaring a character set and the like necessary to process an SGML document in another system. The DTD 302 is a portion defining a structure of a document such as chapter, paragraph, title, etc., which is described in a format as shown in FIG. 32, for example. The DTD 302 shown in FIG. 32 is a portion of DTD of HTML (Hyper Text Markup Language), which is a kind of SGML spread as a description format of WWW (World Wide Web) of Internet.
The document instance 303 is a body of the SGML document, which is made by a writer (user) using an editor of the computer while referring to the DTD 302. The document instance 303 is described using controlling characters (strings) showing elements generally called tags. Each of the tags is defined in the above DTD 302, which represents what is an element in a document instance 303 (for example, whether the element is a title, a chapter, or the like).
FIG. 33 is a diagram showing an example of description of the document instance 303. In FIG. 33, a character string (&lt;TITLE&gt;, &lt;/TITLE&gt;, &lt;SECTION&gt;, &lt;/SECTION&gt;, etc.) sandwiched between "&lt;" and "&gt;", or "&lt;/" and "&gt;" is a tag. As shown in FIG. 33, a portion described as:
&lt;TITLE&gt; {character pullout} {character pullout} &lt;/TITLE&gt;represents that characters (strings) sandwiched between &lt;TITLE&gt; which is a start-tag and &lt;/TITLE&gt; which is an end-tag is an element (a name of title).
There is now a strong movement to employ SGML. In particular, the National Military Establishment of U.S.A. imposes a duty on a person to describe a document in SGML to submit it. In Japan, the Patent Office has decided to employ SGML for CD-ROM publications.
Meanwhile, various types of data such as character codes, vector information, image information, etc. are handled in computers, with the quantity of data being rapidly increasing, in these years. With this, a computer generally eliminates redundant portions in data to compress a quantity of the data so as to decrease a storage capacity for the data, or enable a high-speed data transmission, when handling a large quantity of data.
There are several manners of data compressing. Herein are described an archiver and a compressing drive as examples of application of data compression used in computers.
The archiver is a manner of compressing one or a plurality of data files, and collecting them into one file. By using the archiver on a file rarely used or an old file, it is possible to decrease a capacity of the file. When a server supplies files (data, application or the like) through a personal computer communication or Internet, it is possible to save communication cost, and reduce labor required in transferring collecting all the files into one, using the archiver.
On the other hand, the compressing drive is a manner of compressing data by disk drive such as a hard disk (HD), a floppy disk (FD) or the like of a computer, as a unit. By designating an arbitrary disk drive, all files in the designated drive are compressed and held. In the compressing drive, a compressing/decompressing process is generally performed in a background of the computer, so that compression/decompression (decompression at the time of reading, and compression at the time of writing) is automatically performed in ordinary operations (read/write) by the user. Therefore, it looks to the user that a size of the designated disk system is increased since the user is not at all conscious of compression/decompression of data.
As a coding system used in these examples of application, there is often used universal coding system in which the efficiency of compression is not dependent much upon characters of data, since various data such as character, machine language, image, voice, etc. are handled in the computer.
The universal coding is classified into LZ-coding which utilizes repeatability of a character, and statistical coding which codes a probability of occurrence of a character. The LZ-coding stores a character (string) that occurred in the past in a buffer, and outputs a start position in the buffer and a coinciding length as coded data when the same character (string) occurs. The statistical coding calculates a probability (frequency) of occurrence of a character having occurred in the past, and outputs a code according to the probability of occurrence. The LZ-coding can accomplish a high-speed process, whereas the statistical coding can accomplish a high-compression rate.
The data compressing techniques are ordinarily used to decrease a data amount in the computer or a communication cost. As to a document file, it is possible to compress the whole document so as to manage a large volume of documents.
In the document instance 303 of the SGML document, a quantity of data of the document is increased since tags defining elements in the document are added to the document itself. A study on an SGML document revealed that a proportion of tags in the document exceeds forty percent. Not only documents submitted to public agencies but also manuals attached to products are more being and more changed to SGML documents, recently. Such manual are of several tens to, sometimes, several hundred pages, and are frequently revised. If a history of the revision is included, a quantity of data of the manual is enormous.
If the SGML document is compressed using the above universal coding or other coding system as well as ordinary documents or documents in another format, it is possible to decrease a quantity of the data to some extent. However, the above manners are quite inefficient since a coding system heretofore used is merely applied to the SGML document in any case, in which no consideration is made regarding tags occupying a large portion in the document in the compression.