1. Field of the Invention
The present invention relates generally to a data processing apparatus and a data processing method, and more particularly to data processing method and apparatus in which pieces of input data are compared with pieces of dictionary data, a piece of dictionary data agreeing with a piece of input data is coded, and a piece of coded dictionary data agreeing with a piece of input data is decoded. Also, the present invention relates to data processing method and apparatus in which a piece of input is compressed according to a Huffman code method and a piece of compressed data is decoded according to the Huffman code method.
2. Description of the Prior Art
An information processing apparatus utilizing a storing apparatus such as a magnetic disk apparatus or the like in which a large volume of data is stored and a transmission apparatus for transmitting a large volume of data through a communication line have been recently in widespread use while the information processing apparatus is improved to have a high function and is used for various purposes. In an information processing field, a data compressing apparatus is therefore required to substantially increase a storage capacity of the storing apparatus in data storage and to substantially shorten a data transmission time in data transmission by processing a large volume of data in a high efficiency for the purpose of reducing an information processing cost required for a user.
A data compressing theory is first proposed by Mr. Claude Shannon, the Bell Telephone Laboratories in USA. He disclosed a concept "Entropy" of the data compressing in 1948. Also, the same theory is disclosed by Mr. R. M. Fano, the Massachusetts Institute of Technology in USA in the almost same period. Therefore, the data compressing theory is called a Shannon-Fano coding in general. In the Shannon-Fano coding, the higher an occurrence probability of a character, the smaller the number of bytes of a variable-length code allocated to the character. Therefore, each of pieces of data is compressed.
Thereafter, Mr. Huffman disclosed a variable-length code generating method in 1952 in a literature "A Method for the Construction of Minimum Redundancy Code", and a Huffman coding has been mainly utilized in place of the Shannon-Fano coding in a data compressing field. In the Huffman coding, the data compressing of pieces of data is performed by considering differences in occurrence frequency of characters.
Thereafter, another data compressing method in which a concept of a dictionary is used and a data compressing is performed by considering the repetition of each of character strings. The data compressing method is generally called a slide dictionary method or a Lempel-Ziv (LZ77) coding method. The data compressing method is disclosed in a literature: Lempel Abraham and Ziv Jacob, "A Universal Algorithm for Sequential Data Compression", IEEE Transaction on Information Theory, 1977. Therefore, a basic principle of a conventional compressing algorithm is classified into a data compressing represented by the Huffman coding method in which the data compressing is performed by considering the occurrence probability of each of characters and another data compressing represented by the Lempel-Ziv coding method in which the data compressing is performed by considering the repetition of each of character strings.
The Lempel-Ziv coding method is improved in a Lempel-Ziv-Storer-Szymanski (LZSS) coding method by adding two types of alternation to the Lempel-Ziv coding method. That is, Mr. Store and Mr. Szymanski disclosed in 1982 in a literature "Data Compression via Textual Substitution". In the LZSS coding method, functions in data search are improved.
2.1. First Previously Proposed Art
FIG. 1(A) is a constitutional view of a conventional data compressing apparatus according to a first previously proposed art, FIG. 1(B) explanatorily shows an arrangement view of pieces of dictionary data and pieces of input data in a data searching step, and FIG. 1(C) explanatorily shows another arrangement view of pieces of dictionary data and pieces of input data in a data expelling step. FIGS. 2(A), 2(B) and 2(C) explanatorily shows a coding processing in which the same piece of character string is undesirably duplicated in a dictionary buffer.
For example, as shown in FIG. 1(A), a data compressing apparatus to which the LZ77 coding method (or the sliding dictionary method) is applied is provided with an original data file 1, a data converting apparatus 2 and a compressed data file 3. The data converting apparatus 2 is provided with an input buffer 2A, a dictionary buffer 2B and a central processing apparatus (CPU) 2C.
As a data configuration according to the LZ77 coding method is shown in FIG. 1(B), a coded input data string is formed within a certain memory range and is stored in a dictionary buffer 2B. The dictionary data stored in the dictionary buffer 2B is transferred to the compressed data file 3 without remaining the stored dictionary data as dictionary contents obtained after the data compressing. To set the dictionary buffer 2B in an initial condition, there is a case that a piece of data matching with the data configuration are initially stored in the dictionary buffer 2B.
A functional operation of the data compressing apparatus is described.
A piece of original data read out from the original data file 1 is written in the input buffer 2A of the data converting apparatus 2 as an input data string Din. Thereafter, the input data string Din written in the input buffer 2A and a dictionary data string stored in the dictionary buffer 2B are compared with each other under the control of the CPU 2C to perform a data search. The dictionary data string is formed by storing pieces of input data transferred from the original data file 1. The data search is performed from a head position of the dictionary data string stored in the dictionary buffer 2B, and a longest agreement data string agreeing with a piece of input data is found out in the dictionary buffer 2B.
After the longest agreement data string agreeing with a piece of input data is found out as a result of the comparison under the control of the CPU 2C, the longest agreement data string is coded to compress the longest agreement data string and is stored in the compressed data file 3 as a piece of compressed data Dout.
Therefore, because an input data string previously coded is utilized as a dictionary data string, a longest agreement data string included in the dictionary data string and the input data string in common is found out, and the longest agreement data string can be coded according to the LZ77 coding method.
In detail, in the LZ77 coding method, a piece of input data which is included in the input data string stored in the input buffer 2A having a certain memory range and agrees with another piece of dictionary data included in the dictionary data string stored in the dictionary buffer 2B is found out and is called the longest agreement data string, and the input data is compressed. The longest agreement data string is specified by both an agreement starting position (that is, an agreement starting address or an offset) at which the agreement of the input data and the dictionary data is started in the dictionary buffer 2B and the input buffer 2A and a largest length (normally indicated by the number of bytes) from the agreement starting position to an agreement ending position at which the agreement of the input data and the dictionary data is ended. Thereafter, the input data of the input buffer 2A is transferred to the dictionary buffer 2B just after the input data agrees with the dictionary data.
For example, as shown in FIG. 1(B), a longest agreement data string "uimad" is specified by an agreement starting position indicated by "2" in the dictionary buffer 2B and a largest length equal to 5 bytes. In the input buffer 2A, a next character "f" subsequent to the longest agreement data string "uimad" exists. After the coding of a series of data string "uimad" is finished, as shown in FIG. 1(C), a piece of dictionary data "anuima" having 6 bytes which corresponds to a sum of 5 bytes in the longest agreement data string "uimad" and 1 byte in the next character "f" is expelled from a head portion of the dictionary buffer 2B in a next step. Thereafter, a piece of input data "uimadf" having 6 bytes in the input buffer 2A is transferred to the dictionary buffer 2B as a piece of dictionary data to store the data "uimadf" in the dictionary buffer 2B in place of the dictionary data "anuima" expelled. As a result, it seems that a window of the dictionary buffer 2B is outwardly moved to the right. Therefore, the LZ77 coding method is called the slide dictionary method.
2.2. Problems To Be Solved By The Invention
However, in the Lempel-Ziv coding method, after a data string is coded, a piece of dictionary data having the same byte number as that of the coded data string is expelled from a head portion of the dictionary buffer 2B in which many pieces of dictionary data are stored. Therefore, the same piece of dictionary data is undesirably stored in duplicate in the dictionary buffer 2B, and a piece of dictionary data which has previously agrees with a piece of input data is necessarily expelled from the dictionary buffer 2B. As a result, a data compressing efficiency is lowered.
Therefore, data processing method and apparatus in which a sliding method of a piece of dictionary data is modified not to merely expel a piece of dictionary data or a dictionary data string agreeing with a piece of input data or a piece of input data string from a head portion of the dictionary buffer 2B, the number of reference dictionaries is substantially increased, pieces of dictionary data are not stored in duplicate in the dictionary buffer 2B, the dictionary data or the dictionary data string expelled are efficiently utilized, and a data compressing efficiency is enhanced are desired.
In detail, as is described above, after the coding of a series of data string "uimad" is finished as shown in FIG. 1(B), a piece of dictionary data "anuima" having 6 bytes which corresponds to a sum of 5 bytes in the longest agreement data string "uimad" and 1 byte in the next character "f" is expelled from a head portion of the dictionary buffer 2B in a next step (the slide dictionary method of the LZ77 method). Therefore, there are following problems.
(1) There is a probability that the same piece of dictionary data is undesirably stored in duplicate in the dictionary buffer 2B and a data compressing efficiency is lowered. For example, as shown in FIG. 2(A), in case where a data string "abc" of the input buffer 2A agrees with that of the dictionary buffer 2B in a data searching condition in which the data string "abc" is not coded, as shown in FIG. 2(B), a piece of dictionary data "xyz" having 3 bytes which is equal to the number of bytes in the data string "abc" agreeing with that of the dictionary buffer 2B is unconditionally expelled from a head portion of the dictionary buffer 2B according to a modification type (that is, the slide dictionary method) of the LZ77 method. Therefore, as shown in FIG. 2(C), the data string "abc" remains in duplicate in the dictionary buffer 2B after the data string "abc" is coded.
(2) To increase a data compressing efficiency, it is applicable that a memory region of the dictionary buffer 2B be expanded to widen a searching range. However, when a size of the dictionary buffer 2B is increased, a searching time is considerably increased in general. Also, when a size of the dictionary buffer 2B is increased, it is required to lengthen a data length of a piece of positional information for a piece of data to be coded. In addition, even though a piece of dictionary data previously agreed with a piece of input data, the dictionary data is necessarily expelled from the dictionary buffer 2B.
2.3. Second Previously Proposed Art
FIG. 3(A) is a constitutional view of a first conventional data compressing apparatus according to a second previously proposed art, and FIG. 3(B) is a constitutional view of a second conventional data compressing apparatus according to the second previously proposed art. The first conventional data compressing apparatus shown in FIG. 3(A) is disclosed in a Published Unexamined Japanese Patent Application No. 123619 of 1992 (H2-123619), and the second conventional data compressing apparatus shown in FIG. 3(B) is disclosed in a Published Unexamined Japanese Patent Application No. 280517 of 1992 (H2-280517).
As shown in FIG. 3(A), a first conventional data compressing apparatus (hereinafter, called a first apparatus) obtained by modifying the Lempel-Ziv coding method is provided with a measuring means 11 for measuring an occurrence frequency of a piece of input data DIN, a converting means 12 for converting the input data DIN into a piece of converted data DT according to the occurrence frequency, and a coding means 13 for searching pieces of candidate data relating to the converted data DT one after another according to a dictionary searching list and outputting a reference numeral of a piece of candidate data as a piece of coded data DOUT.
An operation in the first apparatus is described. When an occurrence frequency of a piece of input data DIN is initially measured by the measuring means 11, the input data DIN is converted into a piece of converted data DT according to the occurrence frequency in the converting means 12. In this case, the higher the occurrence frequency of the input data DIN, the lower a value of a code indicating the converted data DT. Also, the lower the occurrence frequency of the input data DIN, the higher the value of the code indicating the converted data DT. Thereafter, pieces of candidate data relating to the converted data DT are searched one after another according to a dictionary searching list by the coding means 13, a piece of particular candidate data agreeing with the input data DIN is found out from the candidate data, and a reference numeral of the particular candidate data is output as a piece of coded data DOUT relating to the input data DIN. Therefore, the input data DIN can be coded to the coded data DOUT in the first apparatus.
Also, as shown in FIG. 3(B), a second conventional data compressing apparatus (hereinafter, called a second apparatus) obtained according to an arithmetic coding is provided with a self-organization coding section (hereinafter, called an SOR coding section) 14 having a searching and registering section 14A and a dictionary rearranging section 14B, a dictionary 15 for storing pieces of dictionary data(or character strings), a counter 16 for counting an occurrence frequency and an accumulated frequency of each of a plurality of character strings, and an arithmetic coding section 17 for arithmetic-coding an SOR code produced in the SOR coding section 14 and outputting a piece of multi-valued code data.
An operation in the second apparatus is described. The dictionary 15 is referred by the searching and registering section 14A of the SOR coding section 14 to recognize whether or not a character string to be compressed is registered in the dictionary 15. Thereafter, the character strings stored in the dictionary 15 are renewed according to a rule of a self-organization by the dictionary rearranging section 14B. That is, the character strings are rearranged on condition that a registration number of a character string is lowered as the occurrence frequency of the character string is increased. When a character string which is the same as that stored in the dictionary 15 is input to the SOR coding section 14, a registration number of the character string in the dictionary 15 is output to the arithmetic coding section 17 as an SOR code by the searching and registering section 14A. When a character string which is the same as that input to the SOR coding section 14 is not stored in the dictionary 15, the character string input to the SOR coding section 14 is registered in the dictionary 15 and is output to the arithmetic coding section 17 as an SOR code. In the arithmetic code section 17, the SOR code is arithmetic-coded to produce a piece of multi-valued code data. In this case, a value of a sign bit and other values of upper and lower bits in the multi-valued code data are determined according to count values of an occurrence frequency and an accumulated frequency of each of the character strings in the arithmetic coding section 17. Thereafter, the multi-valued code data obtained by coding the input character string is output.
2.4. Problems To Be Solved By The Invention
However, in the first apparatus, the pieces of candidate data relating to the converted data DT are searched one after another according to the dictionary searching list, the particular candidate data agreeing with the input data DIN is found out from the candidate data, and a reference numeral of the particular candidate data is output as a piece of coded data DOUT relating to the input data DIN. Therefore, even though the order of a plurality of input data strings is predicted to some extent, it is required to search the candidate data registered in a dictionary having a connected-list structure one after another according to the dictionary searching list.
For example, in case where a piece of input data indicating a sentence in which a word "and" is frequently used is input to the first apparatus, an alphabet "n" subsequent to an alphabet "a" occurs at a high probability. Also, an alphabet "u" subsequent to an alphabet "q" occurs at a high probability. However, even though the input data indicating the sentence in which the word "and" is frequently used is input to the first apparatus, it is required to search the pieces of candidate data relating to the converted data DT one after another. Therefore, there is a problem that a wasteful dictionary searching time and a wasteful data transmission time are required and the data processing cannot be performed at a high speed.
Also, in the second apparatus, when a character string which is the same as that stored in the dictionary 15 is input to the SOR coding section 14, a registration number of the character string in the dictionary 15 is output to the arithmetic coding section 17 as an SOR code, the SOR code is arithmetic-coded in the arithmetic coding section 17, and the multi-valued code data is output. Therefore, even though the order of a plurality of character strings input to the second apparatus is predicted, it is required to refer the dictionary 15 to recognize whether or not a character string which is the same as that input to the second apparatus is registered in the dictionary 15, and it is required to output a registration number of the character string to the arithmetic coding section 17 as an SOR code.
Therefore, there is a problem that the data processing cannot be performed at a high speed, in the same manner as in the first apparatus.
Here, in a data compressing apparatus disclosed in a Published Unexamined Japanese Patent Application No. 68219 of 1993 (H3-68219), the compressing is performed according to an occurrence frequency of each of pieces of input data by applying the Huffman coding. In this data compressing apparatus, an occurrence probability of each of single characters is calculated, and a variable-length code is allocated to each of the single characters. Therefore, because the ununiformity of occurrence frequencies of the single characters is used in this data compressing apparatus, even though the order of a plurality of character strings input to the second apparatus is predicted to some extent, it is required to calculate occurrence probabilities (or the occurrence frequencies) of the single characters. As a result, there is a problem that a compression efficiency for each of the data strings uniformly occurring cannot be heightened.