Technologies of searching an informative portion in a particular document are becoming more and more advanced. In day to day life, as storing data in electronic form becomes a common practice, related technology has are also developed to support data storage in almost every format. For example, nowadays, hand written documents, images, or receipt scan be easily scanned and stored. While the data storing technologies is rapidly developing, searching and retrieving required information from the stored documents is still challenging. There are constraints when one needs to reach a particular portion in the stored data. For example, the stored data may be in an image format that is not searchable.
Many service/product providing sectors, such as BPOs (business process outsourcing), call centers, and government offices (e.g., passport offices and license offices), generate and store hand written and scanned copies of documents (e.g., mortgage applications, insurance claims, and tax returns). For these sectors, such documents are an important part of daily operations. These documents may be obtained from different sources, such as customers, business partners, vendors, governments, and semi-government agencies. Often times, these documents are unstructured and their formats depend on the source from which the documents are obtained. Moreover, such type of data (e.g., the above described documents) may be stored in massive quantities. As a result, when there is a need to search certain information from the stored data, tone may be required to locate and extract the information manually because the data may include non-searchable images. Therefore, searching for information from a large amount of data can be time consuming and challenging. For example, one may be required to scroll down a page and manually locate the requested information to find or validate the information.
Optical character recognition or interactive voice response (OCR/IVR) techniques are commonly used techniques for converting the scanned images of handwritten, typed, or printed data to document having electronic or mechanical forms so that automated data entry or data review can be enabled with respect to the documents. Accuracy of documents generated by the OCR/IVR techniques, however, may be difficult to achieve because the underlying data for generating the documents may not be structured or because noise maybe introduced during the OCR/IVR conversion process.