1. Field of the Invention
The present invention generally relates to document image analysis and, more particularly, to color clustering and segmentation using sigma filtering and a fast sequential clustering algorithm.
2. Background Description
In many applications, such as document image analysis and analysis of digital images, an important processing stage is that of segmentation of the image into regions of near-uniform color. The results of this stage are used for further analysis, such as a determination of the number of colors present in the image, identification of regions with specific color and an analysis of geometric features of regions with uniform color. One of the uses of such segmentation is to decide what compression algorithm to use, based on the content of different regions of the image. Regions with two colors can be more efficiently compressed with bi-level algorithms, regions with a few colors can be compressed with palettized colors, and regions with a large number of colors can be compressed with techniques like the JPEG (Joint Photographic Experts Group) compression algorithm.
Making such a determination in an adaptive manner is a challenging problem due to the many sources of noise that exist in imaging systems. Due to this noise, regions that should have uniform color are broken up into several smaller regions with varying color. Furthermore, the number of colors present in an image is not known a priori. This makes it difficult to adapt a clustering or classification technique to such a situation.
One of the methods used has been to smooth the image prior to clustering, so that some of the noise is removed. However, this has the problem that valid image features, such as character edges, also get smoothed, resulting in a loss of detail.
It is therefore an object of the present invention to provide an improved method of image segmentation which is both fast and retains edge information.
According to the invention, there is provided a technique, based on sigma filtering, to achieve compact color clusters. The idea is to smooth only between the edges, while leaving edge information intact. In addition, a fast sequential clustering algorithm which uses parameters estimated during the sigma filtering operation is used. The advantage of the sequential clustering algorithm is that it is much faster than other techniques reported in the literature which are iterative.
The results of applying these techniques to color clustering are very beneficial, and result in images being segmented easily and with precision. These results can also be used for applications like optical character recognition where accurate segmentation of the text from background is important.