1. Field of the Invention
The present invention relates to data compressing apparatus, reconstructing apparatus, and its method for forming code data from a character train stream constructed by a structured document including tags. More particularly, the invention relates to data compressing apparatus, reconstructing apparatus, and its method for separating tag information from a character train stream of a structured document and performing a coding and a reconstruction.
2. Description of the Related Arts
In recent years, various kinds of data such as character codes, image data, and the like is dealt in a computer. Further, in association with the spread of the Internet and Intranet, the numbers of E-mail and electronized documents are increasing. In such a large amount of data, by compressing the data by omitting redundant portions in the data, a storage capacity can be reduced or the compressed data can be sent to a remote place in a short time.
The field of the invention is not limited to the compression of character codes but can be applied to various data. It is now assumed hereinbelow that the denominations which are used in the information theory, one word unit of data is called a character, and data in which an arbitrary plurality of words are connected is called a character train.
Recently, there is a trend of unifying formats of documents which are handled on computers. In the trend, to efficiently form a document, a method whereby the contents of a document are partially distinguished by using tags, a plurality of document parts such as titles, paragraphs, and the like are preliminarily formed, the relations among the document parts are determined, and the document is structured and edited is tried. As examples of the structured documents such that a concept of a structure is taken in a document, there are structured documents according to the standards of ODA (ISO 8613: Open Document Architecture) and SGML (ISO 8879: Standard Generalized Markup Language) of international standards. As a document processing method using such a structured document, for example, there is a method of JP-A-5-135054. The structured document according to SGML has a high compatibility with a conventional text processing system and has been spread mainly from U.S.A. and put into practical use. In the structured document according to SGML, a template of the document structure is preliminarily given and the document structure is limited within the template.
FIG. 1 shows a SGML structured document constructed by three portions of SGML declaration 200, document type definition (DTD) 202, and document realization value 204. The template which defines the structure of the document is the document type definition 202. As shown in FIG. 2, the document structure such as chapter, paragraph, title, and the like is defined. In the structured document of SGML, in order to express the document structure, a document text is divided by using an identifier called a tag in the document text.
FIG. 3 shows a specific example of the structured document of SGML. For example, in case of a title of a document, it is expressed by “<TITLE> Specification of the Invention (Device)</TITLE>”. That is, characters sandwiched by “<TITLE>” as a start tag and “</TITLE>” as an end tag are elements. In this case, the characters show the title contents “Specification of the Invention (Device)”. At present, the number of cases of using SGML is increasing mainly from public organizations. Especially, in U.S.A., the Department of Defense obliges us to submit documents described by SGML. In Japan as well, such a structured document is adopted as a CD-ROM Official Gazette of the Patent Office. HTML (Hyper Text Markup Language) spread as a description form of WWW (World Wide Web) used by the Internet is one form of SGML.
As a method of compressing a structured document of such SGML or the like, the applicant of the present invention has proposed a method disclosed in Japanese Patent Application Laid-Open No. (JP-A) 9-261072. According to the method, when document data of a structured document having tag information is inputted, the tag information defined by the document type definition DTD or the like is detected. When the tag information is detected, the tag information is outputted as it is without converting. Further, since the tag information is detected, the operating mode is shifted to a mode for coding an input character train except for the tag information.
A basic algorithm of the coding is as shown in FIG. 4. First in step S1, whether an input character or character train is identical to the character or character train preliminarily registered in a dictionary or not is retrieved and compared. If YES, the input data is encoded by a registration number of the dictionary in step S2. In step S3, the code is outputted. When the same registered character or character train cannot be retrieved in step S1, the original input character or character train is outputted as it is in step S5. The above processes are repeated until there is no input character train in step S4. When the SGML document file of FIG. 3 is subjected to the encoding of FIG. 4, a compression data file of FIG. 5 is obtained. The compression data file has a form in which a portion of the tag information which is not compressed and a portion of a compressed text document mixedly exist in a single file.
According to a method of compressing the document text, since a document text having an enormous data amount can be compressed to a data amount which can be used in practice, this method is a very useful technique to realize an electronized document text. In the compression data file of the structured document as shown in FIG. 5, however, in case of retrieving the tag information in the file, the tag information mixedly exists as a non-compression portion in the compressed document data. The whole file has to be developed into a memory and the necessary tag information has to be retrieved. Even when the user wants to retrieve a keyword in the text as a compressed portion, it is similarly necessary to develop the whole file into the memory and process it. In order to retrieve or obtain the necessary document from the compression data file of the structured document, therefore, it is necessary to read an unnecessary portion as a document, an amount of data to be transmitted increases, it takes time to read the data, and there is a problem such that a large memory area and a large disk capacity need to be assured.