Historic advances in computer technology have made it economical for individual users to have their own computing system, which caused the proliferation of the Personal Computer (PC). Continued advances of this computer technology have made these personal computers very powerful but also complex and difficult to manage. For this and other reasons, there is a desire in many workplace environments to separate the user interface devices, including the display and keyboard, from the application processing parts of the computing system. In this preferred configuration, the user interface devices are physically located at the user's location, while the processing and storage components of the computer are placed in a central location. The user interface devices are then connected to the processor and storage components with some method of communication.
Several commercial techniques exist to support the transmission of these user interface signals over standard networks and some are compared in “A Comparison of Thin-Client Computing Architectures”, Technical Report CUCS-022-00, Jason Nieh, S. Jae Yang and Naomi Novik, Network Computing Laboratory, Columbia University, November 2000. One of the challenges facing all of the techniques described in Nieh is transmission of the vast amounts of display data from the processing components to the remote computer across a standard network of relatively low bandwidth. Image compression and transmission differs from other real-time video applications such as broadcast video or offline applications such as optical character recognition (OCR) in a number of ways. Firstly, the image is derived directly from a digital visual interface (DVI) or equivalent noise-free digital source signal as compared with analog video signals or techniques that scan images that inherently include noise. Secondly, the image includes artifacts such as boxes or borders that are common to computer display images and lend themselves to efficient compression. Thirdly, the images include other characteristics associated with computer display images but not associated with motion video or other natural images such as areas of exactly matched color levels and accurately aligned artifacts which also lend themselves to efficient compression.
Text is a common image type. It is desirable to identify text so it can be compressed separately to allow lossless reproduction. Once text elements are identified and separated, they can be compressed efficiently. One compression technique would be to cache the shape and color of the text parts so they can be reused on different images or parts of the image.
A second type of image that is desirable for lossless reproduction is the background artifact type. These artifacts include window backgrounds and other large geometry areas with few colors. Background image types may be coded as a set of graphic commands, which allows for highly efficient compression in addition to lossless reproduction. Furthermore, a background frequently remains constant in an otherwise continuously changing display. By including a separate background type, a remote display can use historic background information rather than requiring the retransmission of static information. This improves the frame-to-frame compression of the display.
A third image type is the picture type. Pictures or natural images that have texture or a large number of colors may be compressed using lossy compression algorithms with little or no noticeable difference. By using a lossy algorithm, pictures can be compressed efficiently.
A fourth image type is the object type that includes areas of high contrast such as graphics, icons and text or other low contrast artifacts surrounded by picture areas. Object types may be encoded using lossless or high quality lossy compression methods. Object types may also be cached and reused. The identification of different types of objects within an image for the purposes of image or video compression is standard practice. Different existing algorithms define “an object” in different ways, depending on the method in which the object is handled. However, previous definitions for an “object” still fail to define a group of pixels in such a way as to more effectively enable compression.
Accuracy of image type identification affects both the quality of the decompressed image as well as the compression ratio. While it is important to maximize the compression in this application, it is more important to ensure the areas of text and graphics and have been correctly identified so they are reproduced accurately.
Layering an image into multiple planes of different image types is a technique that is common use. An mage format based on this is specified in “Mixed Raster Content (MRC),” Draft ITU-T Recommendation T.44, International Telecommunication Union, Study Group 8 (Contribution (10/97). The recommended model defines the image as three planes: a text or graphics plane, a background plane containing continuous tone images and a mask plane. While the recommendation identifies the interchange format, it does not provide a method for generating the mask.
Some related methods for generating decomposition masks are found in text extraction methods. A survey of text extraction methods is provided by Jung et al. in “Text Information Extraction in Images and Video: a Survey, Pattern Recognition 37 (5): 977-997 (2004).
A method for identifying text in images is described by Sato, T., et al., in “Video OCR for Digital News Archives,” IEEE International Workshop on Content-Based Access of Image and Video Database (CAVID '98), pp. 52-60, 1997. Sato et al describe a text mask that is generated by filtering the image. The image is filtered using four directional filters that highlight the shape contrast of a text image. The results of the four filtered images are summed and quantified to generate a text image or mask. While filtering an image in multiple directions and summing the results produces a reasonable mask, it is computationally intensive and does not take advantage of the characteristics of text in a computer display image. The resulting mask can lead to missed and false indications that reduce the compression and image quality.
A method for identifying text in pictures by V. Wu, et al., “Finding Text in Images,” Proceedings of Second ACM International Conference on Digital Libraries, Philadelphia, Pa., pp. 3-12, 1997. In this method, the filtered image is segmented into strokes and chips. Strokes identify lines that build characters and chips identify groups of characters or words. This helps identify text more accurately from other images, which is important for OCR, but is not as necessary for image decomposition. Chip segmentation can also remove small areas of text, like highlighted words, from the mask, reducing the quality of the mask.
Other related methods for mask generation for image decomposition look at separating high-contrast areas from flat areas so they can be compressed differently.
A method for decomposing an image is disclosed by Li et al., “Text and Picture Segmentation by the Distribution Analysis of Wavelet Coefficients” IEEE/ICIP Chicago, Ill. Proceedings, October 1999. This method segments the display into blocks of text, pictures or backgrounds using histograms of wavelet coefficients. While this identifies the image layers, plus the mask layers, it does it at a block resolution. Blocks of multiple pixels cannot create the proper boundaries between these image types. As a result, this method does not provide sufficient compression or image quality.
In U.S. Pat. No. 5,949,555, “Image Processing Apparatus and Method,” Sakai et al. describe a method for decomposing an image. This method uses the shape of objects to identify image types and partitions the image into rectangular areas of different image types. A shortcoming of this method is that the image type is not defined at pixel resolution and therefore it is not possible to select the best compression mechanism in all cases, resulting in either lossy compression of critical information or inefficient compression for non-critical information. Another shortcoming of this method lies in its inability to trace anti-aliased text or text on a textured background because these text types do not have hard edges that can be traced to identify the shape.
In U.S. Pat. No. 6,633,670, “Mask generation for multi-layer image decomposition,” Matthews describes a simpler method for decomposing an image. Rather than identifying areas of text, this method identifies areas of high contrast, which likely include text. The method uses pixel gradients to identify areas of high contrast and then clusters the pixels to generate a mask. This method is not capable of accurately distinguishing between text and textured images. Additionally, the method uses only one mask to distinguish between foreground and background image types, thus limiting options for dealing with the variety of image types associated with a computer display.
In summary, none of the existing methods decompose a computer display image for compression and accurate reproduction. None of the methods identify text, objects, background and picture images separately and at a pixel resolution. Existing methods that provide reasonable accuracy of text identification are too computationally intensive for practical real-time decomposition. None of the methods take advantage of the image characteristics and artifacts of a computer display to simplify and improve the image decomposition. None of the methods decompose the image by identifying backgrounds graphic commands that can compress well. None of the methods identify text on a background surface, which is highly repetitious and lends itself to efficient compression.