1. Field of the Invention
The present invention relates to a compressor, a decompressor, and a data management system for electronic data.
2. Related Background Art
In recent years, the spread of WWW (World Wide Web) leads to increase in data exchange using structured documents, such as HTML (Hyper Text Markup Language) and XML (Extensible Markup Language). Particularly, XML is drawing attention as a next-generation language to supplement HTML and is expected to become most widespread in future in the field of information exchange in the Internet.
XML is a language with data representation to express a hierarchical structure of elements and a document by XML (XML document) is described, for example, as shown in FIG. 18. FIG. 18 is a diagram showing an XML document 10. As shown in FIG. 18, XML is generally classified into markup and text information. In the XML document 10 shown in FIG. 18, markups consist of element start marks (start tags) Ma, element end marks (end tags) Mb, and an empty element mark (empty element tag) Mc. In FIG. 18, <book>, <title>, <authors>, <author>, <contents>, and <chapter> indicate the element start marks Ma. In addition, </book>, </title>, </authors>, </author>, </contents>, and </chapter> indicate the element end marks Mb, and <misc/> indicates the empty element mark Mc. Each of regions from these element start marks Ma to the corresponding element end marks Mb, or the empty element mark Mc indicates an element (an information unit as a basis of XML).
Between an element start mark Ma and an element end mark Mb, another element mark and/or text information is allowed to be described. In the XML document 10 shown in FIG. 18, for example, information defined as text information includes a character string “Fundamentals of XML” in the element <title>, and a character string “YAMADA TARO” in the first element <author> appearing in the element <authors>.
Parent-child relations and sibling relations are defined among the elements and text information. In the case of the XML document 10 shown in FIG. 18, the element to start with the element start mark Ma <book> and end with the element end mark Mb </book> (i.e., the element <book>), contains the element to start with the element start mark Ma <title> and end with the element end mark Mb </title> (i.e., the element <title>). In this case, the element <book> is called a parent element for the element <title>, and the element <title> a child element for the element <book>. This is the parent-child relation between elements.
The element <title> and the element <authors> have the same parent element <book>, and are consecutive. In this situation, the element <title> and the element <authors> are called siblings; the element <title> is called a previous sibling for the element <authors>, and the element <authors> a next sibling for the element <title>. This is the sibling relation between elements.
In general, XML is expressed in the text format like the XML document 10 shown in FIG. 18, in communication between computers or in storage in hard disk apparatus or flash memories. On the other hand, in use for search and correction inside a computer, it is parsed to be transformed into a data structure suitable for the interior of the computer.
FIG. 19 is a diagram showing a data structure 11 obtained by parsing the XML document 10 shown in FIG. 18 to transform it into a format suitable for use inside the computer. In FIG. 19, the elements and text information are described as vertices 301 to 317 with their respective types and values. A type is described on the left side of each vertex 301-317: “E” indicates an element; “T” indicates text information. For example, the vertex 301 has the type 301a of “E”. A value is described on the right side of each vertex: for example, the vertex 301 has the value 301b of “book”. Where a vertex indicates an element, a name of the element (element name) is described in its value; where a vertex indicates a text information item, a character string is described in its value. For example, the vertex 302 indicates the element name <title>, and the vertex 306 indicates the text information “Fundamentals of XML”.
Each vertex 301-317 has reference information chosen from among four references: parent reference, child reference, next sibling reference, and previous sibling reference, in order to express the parent-child relation and sibling relation of the original (non-transformed) XML document 10. In the case of the foregoing XML document 10, wherein the element <title> is the child element of the element <book> and the element <book> is the parent element of the element <title>, the data structure 11 shown in FIG. 19 is constructed as follows: for example, as to the vertices 301, 302, they have the child reference P1 from <book> to <title> and the parent reference P2 from <title> to <book> and those are expressed by arrows. The element <book> also has the element <authors> as a next child element to <title>. In this case, the vertices 302, 303 hold the next sibling reference P3 from the element <title> to the element <authors> and the previous sibling reference P4 from the element <authors> to the element <title>. It is defined that elements in sibling relations except for the head child element (e.g., the element <title>) have no direct parent reference.
In the data structure, the reference information between vertices can be managed separately from the element names and text information; for example, they can be expressed as shown in FIG. 20(a) and in FIG. 20(b), respectively. FIG. 20(a) is a diagram showing cross reference data 400 with the reference information between vertices, and FIG. 20(b) is a diagram showing a table 450 showing an assembly of vertices (also called a vertex group) with types and values set to either of the element and text information.
However, since there is a limit to the capacity of storage devices such as memories, it is required in storage of the data structure to efficiently compress the data structure to store the compressed data. In regard to this matter, Document “Mathias Neumuller and John N. Wilson: “Compact In-Memory Representation of XML” Internal Report of University of strathclyde” (hereinafter referred to as Document 1) discloses a method of compressing the element names and text information as shown in FIG. 20(b). Document 1 discloses the compression method of separately storing the element names and the text information at the respective vertices in the form of a dictionary, providing each vertex with an index of the dictionary, and avoiding redundant storage of an identical character string.
On the other hand, Document “Hartmut Liefke and Dan Suciu.: “XMill: An Efficient Compressor for XML Data”, In proceedings of ACM SIGMOD International Conference on Management of Data, 2000” (hereinafter referred to as Document 2) discloses a method of compressing an XML document by reusing partial structures in the XML document. This method is to separate an original XML document into three, structure, element name information, and text information, and compress each of them by ordinary compression algorithms such as LZ77 (reference should be made as to the details of LZ77 to “Jacob Ziv, Abraham Lempel: A Universal Algorithm for Sequential Data Compression. IEEE. Transactions on Information Theory 23(3): 337-343 (1977)”). The compression method disclosed in Document 2 will be described below. In this compression method, first, each of element start marks and empty element marks is replaced with a short element name such as “#1”, “#2”, and so on and each element end mark with “/”. The text information is replaced with “C”. When the above compression method is applied to the separated XML document 10, the data structure 12, element name information 13, and text information 14 after the separation are expressed as shown in FIG. 21, in FIG. 22, and in FIG. 23, respectively.
In the compression method described in Document 2, they are compressed independently of each other by use of a compression algorithm represented by LZ77 or the like. The compression algorithm will be outlined below. The compression algorithm of LZ77 or the like is to discover partial patterns included in original input information and repetitively reuse they as templates, thereby effecting compression. For example, let us explain the compression of data structure 12 shown in FIG. 21. Templates X, Y, Z, W, and V are used as templates and the templates are assigned as follows: X=“#1 #2C/#3”; Y=“#4C/”; Z=“/#5”; W=“#6 C/”; V=“/#7//”. The data structure 12 shown in FIG. 21 can be expressed as “XYYYZWWV”. This uses Y and W plural times as templates indicating partial document structures. If the templates can be repetitively used to express the original document by a small number of templates in this manner, the volume of information indicating the original XML document can be reduced, so as to enable the compression.