Automatically processing documents and forms requires machine printed text to be separated from handwritten text so the document can be prepped for scanning. The text is separated such that an optical character recognition (OCR) or an intelligent character recognition (ICR) can correctly capture and interpret the text.
Previous methods for performing separation of machine printed text from handwritten text were done on either clearly isolated documents or at a patch level. This would make it difficult to apply more general document images where both machine printed text and handwritten text could be potentially overlapping. Alternatively, a pixel-level separation would be necessary for reasonable OCR performance.
Another method uses Markov random fields (MRF) to do text separation on targeted documents that are highly imbalanced in terms of machine printed text versus handwritten text. The major heuristic is that the handwritten text appears as annotations so using MRF can effectively smooth large regions of machine printed text unless the evidence of handwriting text is very strong. The assumption is restrictive since not all documents contain handwriting text as annotations.