1. Technical Field
The present invention is directed to data compression and decompression, and more particularly to a method and system for dot-matrix font data compression and decompression.
2. Description of the Related Art
A method for compressing dot-matrix font data should be lossless, i.e., the compression result must be such that it can be decompressed back to the original dot-matrix font image. Some lossless font data compression techniques employ the entropy coding approach. According to the theory of entropy coding, in a set of “n” number of multiple image patterns (1˜n), an image pattern that occurs more often is assigned a shorter bit-length code while an image pattern that occurs less often is assigned a longer bit-length code. Thus, the average bit-length of codes assigned to multiple image patterns in the set is:
                    A        =                              ∑                          i              =              1                        n                    ⁢          liPi                                    (                  Eq          .                                          ⁢          1                )            where l is the code bit-length and P is the probability of an image pattern occurring in the set, such that li means the code bit-length of the ith image pattern and Pi means the probability of the ith image pattern, where
                                          ∑                          i              =              1                        n                    ⁢          Pi                =        1.                            (                  Eq          .                                          ⁢          2                )            
In dot-matrix font data, each character is defined as a collection of (activated) dots in a matrix. For example, a Chinese character may be defined as a collection of dots, in a 16×16 matrix for example, as shown in FIG. 1A. FIG. 1A shows one character 20 in a 16×16 matrix, including an upper straight portion 20a and a lower bifurcated portion 20b. Typically, an Asian font is composed of a large number of characters. For example, the Ming-style Chinese font includes 13,108 characters. For the purpose of applying the entropy coding approach to compressing dot-matrix Asian font data, treating each Asian character as an image pattern may be impractical due to the sheer volume of characters in each set. Thus, it is possible to divide each dot-matrix character into multiple rows (or columns) of the matrix and to treat each row (or column) as an image pattern. For example, in FIG. 1A, the 16×16 matrix can be divided into sixteen rows 1-16, wherein each row includes sixteen pixels. In the illustrated example, the first row (row 1) has 7th and 8th pixels activated, and thus forms an image pattern that can be expressed as: 0000001100000000, where 1 represents an activated pixel and 0 represents a non-activated pixel. The second through sixteenth rows can be expressed in the same manner, and each character can be expressed as a collection of sixteen image patterns.
Two image patterns with different number and/or distribution of activated pixels (dots) constitute two different image patterns. Thus, in the example of the 16×16 matrix, each character may consist of 1 to 16 image patterns. For example, a character may consist of one image pattern that appears repeatedly through multiple rows, as in the upper straight portion 20a of the character 20 in FIG. 1A. In the following, the phrase “image patterns” means different image patterns, unless specifically characterized as the same image pattern that appears multiple times.
Consider a case wherein a set includes “m” number of 16×16 dot-matrix characters, wherein the set also includes “n” number of image patterns. Assume further that all sixteen image patterns in each character are different from each other, i.e., n=m×16. In this extreme case, the probability of each of the image patterns occurring in the set is the same for each image pattern, i.e., every dot-matrix character in the set is considered randomly generated. Of course, in this extreme case, no compression is possible because the set of compressed characters would require the same amount of bit space as that of the original dot-matrix font data, i.e., the bit space of 16×16×m. However, this extreme case does not happen in reality because dot-matrix font data are not randomly generated. Rather, dot-matrix font data are constructed according to order such that, for example, some image patterns that constitute basic components (radicals) of Asian characters occur more often than other image patterns. Thus, typically, the occurrences of dot-matrix font image patterns in a character set (some occurring more often than others) can be represented statistically as a normal (or Gaussian) distribution function.
Theoretically, a set of dot-matrix font characters sized 2k×2k matrix (k: positive integer) can be compressed such that the average bit-length of codes that are assigned to multiple image patterns in the set is shorter than 2k, i.e., 2k×g where g<1. The compression ratio is defined as 1: g, i.e., the ratio of 1 byte of the original storage requirement relative to g bytes of the storage requirement after compression.
Usually, the entropy encoding is implemented as the Huffman coding method known in the art, which includes the steps of: (1) calculating the probability of occurrence of each image pattern in a set; (2) sorting the order of image patterns according to their occurrence probability from high to low; (3) from low to high, sequentially merging two adjacent image patterns into nodes and repeating step (2) until all image patterns are merged to a single node, or the root, of a leveled tree, i.e., until a Huffman tree is constructed; (4) tracing paths of the tree from the root to leaves while assigning shorter bit-length codes to the image patterns with higher occurrence probability and assigning longer bit-length codes to the image patterns with lower occurrence probability; and (5) generating a Huffman table that records the compressed codes (the generated codes having a shorter or longer bit-length) and the corresponding image patterns that the codes represent. The average bit-length of codes generated by the Huffman coding method is close to that of the entropy coding.
However, the effectiveness of lossless font data compression using the Huffman coding method depends heavily on the size of the tree—the larger the tree the poorer its performance. In other words, the size of a Huffman tree should be kept reasonably small. Referring to FIG. 1B, a curve 21 represents an ideal distribution function curve, wherein the horizontal axis indicates different image patterns and the vertical axis indicates the number of occurrences of each image pattern in the set, i.e., the occurrence probability of each image pattern. The curve 21 has an ideal shape in that those image patterns that occur more often are plotted on or around the mean (vertical) axis, while other image patterns that occur less often are plotted toward both sides of the curve, in a distributed manner along a generally bell-shaped curve. For the sake of illustration, a triangle 22 is shown in FIG. 1B, which connects the apex 21a of the curve 21 to the leftmost and the rightmost cross-points 21b, 21c of the curve 21 with the horizontal axis. The triangle 22 is used to describe the shape of the normal distribution. Generally, a sharper (pointier) triangle yields a better compression result because higher-probability image patterns are concentrated on or around the center line of the triangle 22 (i.e., the mean of the normal distribution) and thus a lesser number of total image patterns exist to form a Huffman tree. On the other hand, a flatter curve 23 represented by a flatter triangle 24 in FIG. 1B yields a poorer compression result. In the extreme case discussed above, where the occurrence probability of every image pattern in the set is the same, the triangle 24 will degenerate into a horizontal line. In this case, there is no point in trying to compress font data, and the Huffman coding becomes completely ineffective.
One experimental result shows that a direct application of the Huffman coding to the set of 13,108 Ming-style Chinese characters sized 16×16 dot-matrix produces a substantially flat triangle like the triangle 24 of FIG. 1B. This means that a voluminous number of image patterns exist in the set, where each row (or column) in the dot-matrix is considered as forming an image pattern, and that most of these image patterns have nearly the same probability of occurrence. The large number of image patterns leads to a large Huffman tree and, consequently, the compression result becomes very poor. This experimental result also explains why the font market, in particular the Asian font market, currently does not offer any product for compressing/decompressing dot-matrix font data. Thus, a need exists for a practical compression and decompression mechanism for dot-matrix font data, in particular for Asian dot-matrix font data including a large number of characters.