Information is frequently presented as a color image or a grayscale image, even though its content could be presented as a binary image. Examples for such information are printed documents, hand written notes, barcodes, some of the information displayed on TV screens both on tele-text and on the normal screen, some presentations on computer monitors, etc.
In some situations, it may be desired to convert such a color or grayscale image by means of binarization into a binary-valued image, for example in order to facilitate a document analysis or to reduce the data amount. In most document imaging systems, a binarization process typically precedes the document analyzing procedures.
Usually, binarization comprises comparing an original value of each pixel of an image with a threshold value. A binary value for this pixel may then be set to a first value, for instance to black, in case the threshold is exceeded, and to a second value, for instance to white, otherwise. The pixels of one value, for instance all black pixels, may then represent a recognized object, while the pixels of the other value, for instance all white pixels, may represent the background.
The task of implementing binarization efficiently can be quite complex. Frequently, the physical dimension of a printed text that must be binarized for a document processing application varies significantly, even on the same page. Also, if binarization of hand-written notes is to be enabled, the complexity increases.
In addition, most conventional object recognition systems were developed specifically for document images acquired by a scanner. However, the popularity of digital cameras is increasing. Also mobile phones and other mobile devices are equipped to an increasing extent with embedded camera components or with a facility for connecting an accessory device with camera components, which allow taking pictures and recording movies. Digital cameras or embedded camera components could thus be used as a new kind of input interfaces, for example for a character scanning and recognition functionality. Therefore, a need for the possibility of processing camera images as well is getting more important. Advantageously, the involved mobile device itself is equipped with image binarization facilities.
An object in a camera image is more difficult to recognize than an object in a scanner image. The reason is that with a camera, it is more difficult to control the imaging environment than with a scanner. Even if the user is assumed to take the image carefully, the processing of the obtained camera image may be problematic. Firstly, the brightness of a camera image may not be uniform because of an uneven lighting or because of an aberration of the camera lens. Secondly, the color level surface of a camera image is smoother than the color level surface of a scanner image. In other words, the edge of an object in the image, for example of a character, is not as clear as in a scanner image, and therefore, the difference in intensity between the foreground and the background varies in many camera images. Thirdly, a camera image may be distorted by sensor noise added to the captured image, as well as by optical blur and vignetting that are due to the optical system of the camera, namely the camera lenses. These problems are especially significant in camera images that are not well focused.
In order to deal with a camera image, the performance of the binarization is thus very important. However, the performance of conventional local binarization approaches depends on parameters employed in the binarization. Even if a particular set of parameters works nicely for one image, these parameters will most likely not be suitable for other images. Moreover, even if an optimal parameter is computed carefully, the binarization approach may fail to preserve important details of the structure of an object. Consequently, several problems have to be considered when binarizing a camera image.
Various methods have been developed to binarize an image. These methods can be classified into global binarization methods and local binarization methods.
In a global binarization method, the most important step consists in determining a global threshold value. This threshold value will then be used as a deciding factor for each pixel of the image. This method is based on the assumption that the input histogram is bi-modal. The advantage of this method is its simplicity and effectiveness for uniform images. But if the background or the noise characteristics are non-uniform, this approach may result in large errors.
In a local binarization method, a dedicated threshold is determined for every pixel of an image based on some local statistics. Many popular local binarization methods are window-based approaches, in which, the local threshold for a pixel (i, j) is computed from gray level values of the pixels in a window centered at (i, j). Various formulas have been proposed for computing such a local threshold.
A local binarization method has also been presented by P. D. Wellner in: “Adaptive thresholding for the Digital Desk”, EuroPARC Technical Report EPC-93-110, 1993. The presented method is specifically designed for images containing printed text and uses moving averages to decide about the outcome of the binarization. The idea of this method is to run through the image while calculating a moving sum of the last n seen pixels. When the value of a pixel is significantly lower than a moving average, it is set to black, otherwise it is left white. This method requires only one pass through the image. The image is treated as a one-dimensional stream of pixels, and a moving sum that can be used for determining the moving average can be computed directly, or estimated based on the following equation:
            M              i        +        1              =                  M        i            -                        M          i                n            +              p                  i          +          1                      ,where Mi+1 is the estimate of the moving sum for pixel pi+1. The one-dimensional stream of pixels is taken from the camera image according to a scanning method called “boustrophedon”, meaning “as the ox plows” in Greek language. That is, pixels are taken from one row progressing from the left to the right and from the next row progressing from the right to the left, etc. By using this method, a bias from one side of the image over the other is avoided. The estimated moving sum is then used as a local threshold value in accordance with the following equation:
  P  =      {                                        0            ,                                                                              if                ⁢                                                                  ⁢                                  p                  i                                            <                                                                    M                    i                                    n                                ⁢                                  (                                      1                    -                                          α                      100                                                        )                                                      ;                                                            255            ,                                                otherwise            ,                              where P is the resulting binary value for pixel pi, where n is the number of pixels considered in the moving sum, and where α is a fixed percentage value. The term Mi/n represents the actual moving average. It has to be noted, though, that also the moving sum Mi itself is frequently referred to as a moving average. A simple extension of this algorithm averages the current threshold with the one from the row above, in order to take account of illumination changes and to consider also the vertical axis of the image.
It is a likely problem of this method that the percentage value α used to select the threshold from the mean is fixed. Experiments show that this single value becomes insufficient when a large variety of printed text image types are being used and when different capturing conditions occur.