1. Field of the Invention
This invention relates generally to information processing. More particularly, the invention relates to methods for discriminating add-on information from a scanned document image having the add-on information and original text.
2. Discussion
With the fast growth of computer-based systems in the past few decades, office workers now commonly use computerized word processing and office business systems to produce, edit and revise documents of all varieties, including printed text documents, spreadsheets, business presentations and the like. While these computerized systems include powerful document editing tools, there are times when it is more expedient to edit or annotate printed documents by simply writing revisions above the text or in the margins of a printed copy of the document. Sometimes, for example, the person making edits or annotations may not have access to an electronic copy of the document and may therefore be unable to use computerized document editing tools.
Moreover, there are also times when the person editing or annotating a printed document may make handwritten changes or additions to the document and then later need to have a copy of the document in its original, unedited and annotated form. Unless an extra copy of the document was previously saved, in its original form. Anyone who has ever tried to reconstitute a heavily edited document by manually erasing or covering up the edits and annotations understands how tedious and time consuming the process is. Automated methods of separating handwritten annotations from printed text, if developed, could potentially relieve much of the tedium.
The document reconstitution issue aside, hand-drawn annotations present other processing challenges, namely, how to identify and use the hand-drawn annotation to code a document for storage and retrieval in a computerized database. It would be quite useful, for example, if scanned images of paper documents could be categorized, stored and retrieved based on handwritten designations placed on the document prior to scanning. That would allow the user to quickly code a document by hand, leaving the imaging system with the task of identifying and reading the coded instructions and storing the document appropriately.
In general, detecting and using add-on information from a scanned document image can be very important because once add-on contents of the document are obtained, they may exhibit richer information than a static scanned document image. First, the printed text and possibly graphics reflect the originality of the document while the add-on contents such as handwritten annotations, stamps etc. reflect the alteration that has been done to the original document. Second, being able to differentiate the post-alternation done to the document can be beneficial to a document management system in several ways. For examples, the separation of the post-alteration may restore contents of add-on information via OCR/ICR or other pattern recognition/matching techniques. The history of a document may be recorded by restoring the original content from a document containing the post-alteration. Additionally, secure transmissions of original document content without leaking add-on information, and efficient compression and storage scheme may also be achieved. In the case where the original document is already stored in the database, the copy with add-on information need not be stored entirely in the database, whereas only add-on information needs to be stored.
Several attempts have been made address the need to separate handwritten annotations from printed text. One of them is a method for compressing images of bank checks that separates the handwritten annotations from the static check form. Such a method entirely depends on a document identifier such as a magnetic ink character recognition (MICR) line in order to separate the handwritten text from the static check form. However, the requirement of the document identifier limits such attempts to very specialized fields such as x-rays and NMR images, thereby increasing the cost and reducing the availability.
Other limited applications appear in the field of form processing. For example, in the form processing, handwritten entries on a form can be extracted using the standard template. This instant method is useful in processing large amounts of the forms having the same format such as magazine subscription forms, account forms, etc. However, the template has to be replaced when different types of documents are to be processed because the template can only handle a limited number of the different types of the documents. In reality, a document management system needs to handle various types of documents such as business letters or forms, images, fax documents, etc. Thus, the form processing method has limited use, and may be very time consuming and ineffective.
While the above described information processing methods have proven to be effective for their intended use, it is required that a new automatic separation technique that truly benefits from the add-on information separation be developed. Additionally, it would be highly desirable if the new method is not limited to specific field/formats, yet provides highly efficient separation of the add-on information from the original text.