Document processing systems are capable of scanning documents and processing the resulting digital images. The processing might include image display (e.g., printing), compression, page segmentation and recognition, and optical character recognition (OCR). Compression reduces the size of the digital images, which reduces the cost of storing and transmitting the digital images. Page segmentation and recognition may be performed to separate natural features (e.g., photos) from text and other graphical features in compound documents. OCR may then be performed on text.
Scanned document images can be distorted with respect to the original documents. Scanning distortion can be caused by scanner smoothing and integration, electronic noise, and inaccurate measurement of white level. These scanning distortions can blur edges, and create noise and artifacts in digital images. Perceptible noise and artifacts can degrade image quality. Perceptible and imperceptible noise and artifacts can reduce compressibility. Reducing compressibility can increase the cost of storing and transmitting the images. The noise and artifacts can also increase the error rate of processing routines such as OCR.
The way in which the documents were created can also lead to distortions in the scanned image. For example, a printed document might contain halftone regions. Distortions such as Moire patterns can arise from interaction between halftone patterns and a scanner. The Moiré patterns and other halftoning noise artifacts can also degrade image quality and reduce compressibility.
Bleed-through artifacts can occur if a document is printed on both sides. When one side of a double-sided document is scanned, features on the opposite side of the document can be captured. These features appear as artifacts in the scanned digital image, manifested as phantoms of text characters and other dark features from the other side. The bleed-through artifacts can also degrade image quality and reduce compressibility.