Even as the organisations across the world are moving more rapidly towards electronic transactions, areas like financial services, insurance companies, government agencies continue to deal with enormous amounts of paper-based transactions. Some of these enterprise use paper-based forms like applications, invoices, examination papers and other documents as a medium for capturing data. The data captured is submitted for electronic processing and the content is fed into a business system. Information on such documents may be handwritten, machine printed, or a combination of both and can be captured and further passed into the system for data extraction.
Data extraction may be done either manually or through electronic processing. Manual data entry involves human involvement and thus it may be time consuming and can be error prone. Alternatively, electronic processing reduces the cost while improving processing speed and accuracy. In recent years, organizations have been increasingly relying on automatic recognition technology to capture the information from paper documents. Various technologies such as Optical Character Recognition/Intelligent Character Recognition and pattern recognition are applied to document images to capture and process the data from such images.
Currently, automatic data recognition technology aims at automatically identifying and capturing the document data and to feed appropriate image data to a data extraction engine. Such recognition technology may identify content of the documents such as presence of characters, patterns, and the like. More recently, intelligent document recognition techniques such as automatic form recognition have been developed. Such techniques are used to identify both hand written as well as machine entered content. However, more intelligence needs to be added to such solutions while dealing with hand written data. In some cases, certain hand-written marks which are not a part of actual content need to be identified and removed for more efficient data extraction.
For example, in some cases, the data in the documents used in a business flow are validated by a user or an examiner by applying correction marks such as tick marks and/or cross marks on the document images. Presence of such user entered marks poses a big challenge in automatic extraction of data from the document images. This sometimes may lead to incorrect decoding resulting in extraction of unwanted data from the document images. For document processing systems, such as invoice processing, pattern recognition for the line item data extraction is very significant. Due to the presence of user entered marks on the line items, blocks may be identified incorrectly which reduces the pattern recognition accuracy.
Thus, there is a need to effectively process document images and remove unwanted data and marks from the images to enable accurate data extraction.