1. Field of the Invention
This invention relates to document image processing, and in particular, it relates to a method for removing graphics from a document image while preserving text.
2. Description of Related Art
Document images typically refer to digital images that contain significant amount of text. Such images may also contain graphics, such as pictures or other types of graphics. Graphics in a document image can make it harder to understand the text of the document, for example when the goal is to extract the text of the document to perform OCR (optical character recognition). Therefore, one important step in processing document images is removal of graphics. The goal of graphics removal is to obtain a document with only texts for further analysis such as OCR, document authentication, etc.
Many graphic removal methods are known. Some exemplary publications include: C. Xu, Z. Tang, X. Tao, and C. Shi, “Graphic composite segmentation for PDF documents with complex layout”, Document Recognition and Retrieval XX, Feb. 4, 2013; R. Garg, A. Bansal, S. Chaudhury, S. D. Roy, “Text graphic separation in Indian newspapers”, 4th International Workshop on Multilingual OCR, 2013; U.S. Pat. No. 8,634,644 “System and method for identifying pictures in documents,” issued Jan. 21, 2014. A common strategies in known methods for separating text and graphics include using OCR for text recognition, or examining size and/or color of graphics. Some methods use the geometric properties of text characters to separate them from graphics.