The amount of information available via computers has dramatically increased with the wide spread proliferation of computer networks, the Internet and digital storage means. With such increased amount of information has come the need to transmit information quickly and to store the information efficiently. Data compression is a technology that facilitates effectively transmitting and storing of information
Data compression reduces an amount of space necessary to represent information, and can be used for many information types. The demand for compression of digital information, including images, text, audio and video has been ever increasing. Typically, data compression is used with standard computer systems; however, other technologies make use of data compression, such as but not limited to digital and satellite television as well as cellular/digital phones.
As the demand for handling, transmitting and processing large amounts of information increases, the demand for compression of such data increases as well. Although storage device capacity has increased significantly, the demand for information has outpaced capacity advancements. For example, an uncompressed image can require 5 megabytes of space whereas the same image can be compressed and require only 2.5 megabytes of space. Thus, data compression facilitates transferring larger amounts of information. Even with the increase of transmission rates, such as broadband, DSL, cable modem Internet and the like, transmission limits are easily reached with uncompressed information. For example, transmission of an uncompressed image over a DSL line can take ten minutes. However, the same image can be transmitted in about one minute when compressed thus providing a ten-fold gain in data throughput.
In general, there are two types of compression, lossless and lossy. Lossless compression allows exact original data to be recovered after compression, while lossy compression allows for data recovered after compression to differ from the original data. A tradeoff exists between the two compression modes in that lossy compression provides for a better compression ratio than lossless compression because some degree of data integrity compromise is tolerated. Lossless compression may be used, for example, when compressing critical text, because failure to reconstruct exactly the data can dramatically affect quality and readability of the text. Lossy compression can be used with images or non-critical text where a certain amount of distortion or noise is either acceptable or imperceptible to human senses. Data compression is especially applicable to digital representations of documents (digital documents). Typically, digital documents include text, images and/or text and images. In addition to using less storage space for current digital data, compact storage without significant degradation of quality would encourage digitization of current hardcopies of documents making paperless offices more feasible. Striving toward such paperless offices is a goal for many businesses because paperless offices provide benefits, such as allowing easy access to information, reducing environmental costs, reducing storage costs and the like. Furthermore, decreasing file sizes of digital documents through compression permits more efficient use of Internet bandwidth, thus allowing for faster transmission of more information and a reduction of network congestion. Reducing required storage for information, movement toward efficient paperless offices, and increasing Internet bandwidth efficiency are just some of many significant benefits associated with compression technology.
Compression of digital documents should satisfy certain goals in order to make use of digital documents more attractive. First, the compression should enable compressing and decompressing large amounts of information in a small amount of time. Secondly, the compression should provide for accurately reproducing the digital document. Additionally, data compression of digital documents should make use of an intended purpose or ultimate use of a document. Some digital documents are employed for filing or providing hard copies. Other documents may be revised and/or edited. Many conventional data compression methodologies fail to handle re-flowing of text and/or images when viewed, and fail to provide efficient and effective means to enable compression technology to recognized characters and re-flow them to word processors, personal digital assistants (PDAs), cellular phones, and the like. Therefore, if hard copy office documents are scanned into digital form, current compression technology can make it difficult if not impossible to update, amend, or in general change the digitized document.
Digital documents generally include a large amount of textual information. Without any compression, a single 8.5 by 11 inch document at 200 dots per inch (dpi) uses almost 2 MB of storage space. But, textual information has properties, which afford for compression. One approach to compress textual information is to perform optical character recognition (OCR) on the text and represent the document as a sequence of character codes in a standard alphabet such as ASCII. However, there are some drawbacks to using OCR in that OCR is not completely reliable, particularly with respect to poor quality documents. Noise in the document, varying typefaces and unusual characters all can produce OCR errors. Additionally, special fonts, foreign languages and mathematical formulas create special problems.
Another approach for compressing digital documents is to use clustering. Clustering involves finding connected components (a connected component is a set of pixels of a given color which are connected) of a document, and the connected components are searched and analyzed to locate similarly connected components referred to as clusters. The clusters generally, can greatly increase compression and can avoid some of the reliability problems of OCR. For example, a single paged 8.5 by 11 inch document at 200 dpi uses almost 2 MB of storage space uncompressed, but uses only about 200 k with clustering. The reason for the sharp reduction in file size, is that each connected component can be summarized by a position, and a pointer to a shape belonging to a dictionary of shapes. The clustering part of the algorithm determines which shape(s) should belong to the dictionary, and which shape is the closest to each connected component. Typically, the dictionary of shapes is a fraction of the size of an original document image, and can even be shared across pages. The pointers to the shapes can be characterized by a position in the page (X and Y), and a shape number. The X and Y position can be compressed using previous position, while the shape indices are compressed using context or a language model. Thus, clustering can greatly increase compression; however analyzing connected components to find similar connected components (clusters) is generally, a computationally intense process. A single page or multi page document can easily have thousands of connected components or more that are compared in order to find similar connected components.