Optical Character Recognition (OCR) technique converts different types of documents such as a scanned document, a photo, a PDF document, etc., into editable and searchable form. Typically, an OCR engine receives a scanned image as an input and generates an image layer corresponding to the scanned image. The image layer is then processed by first creating a two-dimensional digital representation of the scanned image and then converting the two-dimensional digital representation into a series of characters in order to generate an OCR layer. In the final output of the OCR engine, the single OCR layer is superimposed as an invisible text layer over the image layer. Finally, the OCR document is generated. The primary purpose of the OCR layer is to enable a user to copy the text content from the OCR document and paste into another document.
Most of the existing OCR techniques generate a single OCR layer. Further, the OCR layer may not follow the actual format of a document and may be formatted as left to right and top to bottom. Thus, the single OCR layer concept does not work well with documents having information in multiple columns and rows. Few examples of such documents include passports, invoices, bank statements, computerized receipt, and many others. As an example, an invoice may include an address field, a date field, a consumer number field, a product description field, etc. Here, when a user tries to copy certain text content from such type of documents, some undesired text content may also get copied. For instance, if a user wishes to copy ten lines from column one alone, then the selection may automatically get extended to other columns as well. In other words, when a user tries to select and copy text of only address field from an OCR invoice, then text content which is in same ‘X’ coordinate may also get selected automatically. In view of this, there is a need for methods and systems enabling a user to select and copy text fields of their choice i.e., desired text from an OCR file, without extending the selection of undesired content.