The methods and systems disclosed herein are related to the art of digital image processing and compression.
By way of background, image compression refers generally to the application of data compression on digital images. In effect, the objective is to reduce redundancy of the image data in order to be able to store or transmit data in an efficient form.
For example, JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression. In its lossless mode JBIG2 typically generates files one third to one fifth the size of Fax Group 4 and one half to one quarter the size of JBIG, the previous bi-level compression standard released by the Group.
Tokenization is a powerful tool that clusters text into groups. It has been used in JBIG2 image compression and in scanned text editing (Adobe). In image compression, it outperforms Fax Group 4 by a factor of 3-5 in lossless mode in terms of compression ratio.
Tokenization typically works in the following manner. First, a dictionary is created, which is empty initially. Next, each symbol found in a scanned document is matched to the symbols in the dictionary. If a match is found, the symbol is clustered to the group specified by the dictionary symbol. Otherwise, a new cluster is created and the symbol is added to the dictionary. Although not required, a symbol typically corresponds to a character of text.
In JBIG2, tokenization can both be lossless and lossy. In lossless mode, the matching error (between the symbol in the document and the symbol in the dictionary) is losslessly coded. In lossy mode, the error may be partially or entirely discarded. Furthermore, additional prefiltering might be performed before tokenization to smooth out shape variations and encourage matching.
The matching usually does not need to be perfect. It should tolerate slight variations for the characters of the same shape. There is a tradeoff between accuracy and coding efficiency. A too tight matching criterion may result in too many clusters. In other words, the same character with slight variations may be classified as different clusters. On the other hand, a too loose matching criterion may generate too few clusters and run the risk of cluster two different characters into the same group. The misclustering may also be a consequence of pre-filtering. In some cases, filtering slightly changes the shape, but sufficiently large to cause misclassification.
Morphological dilation is commonly used image filtering technology. When applied to the text, it will typically make characters thicker and smoother. The degree of thickness/smoothness is controlled by the “structuring element” used in dilation. A detailed description of dilation and structuring elements can be found in R. Gonzalez and R. Woods Digital Image Processing, Addison-Wesley Publishing Company, 1992, pp 518-519, 549. In tokenization, dilation is often applied as pre-filtering to reduce the impact of slight variation. By way of example, FIG. 1 represents an image filtered by a dilation with a 2×2 structuring element. FIG. 2 is the result after JBIG2 compression. Note that the number “385,365” in the third line became “385,385” due to misclustering.
The exemplary embodiments disclosed herein contemplate new and improved methods and systems that resolve the above-referenced difficulties and others.